[
    {
        "id": "authors:hfsjy-tmv33",
        "collection": "authors",
        "collection_id": "hfsjy-tmv33",
        "cite_using_url": "https://authors.library.caltech.edu/records/hfsjy-tmv33",
        "type": "article",
        "title": "Flexible parsing, interpretation, and editing of technical sequences with splitcode",
        "author": [
            {
                "family_name": "Sullivan",
                "given_name": "Delaney K.",
                "orcid": "0000-0002-8359-6705",
                "clpid": "Sullivan-Delaney-K"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "<div class=\" sec\">\n<div class=\"title\">Motivation</div>\n<p class=\"chapter-para\">Next-generation sequencing libraries are constructed with numerous synthetic constructs such as sequencing adapters, barcodes, and unique molecular identifiers. Such sequences can be essential for interpreting results of sequencing assays, and when they contain information pertinent to an experiment, they must be processed and analyzed.</p>\n</div>\n<div class=\" sec\">\n<div class=\"title\">Results</div>\n<p class=\"chapter-para\">We present a tool called&nbsp;<em>splitcode</em>, that enables flexible and efficient parsing, interpreting, and editing of sequencing reads. This versatile tool facilitates simple, reproducible preprocessing of reads from libraries constructed for a large array of single-cell and bulk sequencing assays.</p>\n</div>\n<div class=\" sec\">\n<div class=\"title\">Availability</div>\n<p class=\"chapter-para\">The&nbsp;<em>splitcode</em>&nbsp;program is free, open source, and available for download at&nbsp;<a class=\"link link-uri openInAnotherWindow\" href=\"http://github.com/pachterlab/splitcode\" rel=\"noopener\">http://github.com/pachterlab/splitcode</a>.</p>\n</div>\n<div class=\" sec\">\n<div class=\"title\">Supplementary information</div>\n<p class=\"chapter-para\"><span class=\"content-section supplementary-material\"><a href=\"https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/PAP/10.1093_bioinformatics_btae331/1/btae331_supplementary_data.zip?Expires=1721759365&amp;Signature=O11BF6Xx3OtqyDRM61VOx0WnGjLmT2HOFIXK5F1~B5nc7gCzA7yONKI7pTKFCUu~zmnY-Y0MvVYpIiYqrNBzAjNYsFU1x1nE6oGig7cc1h-x-mM5afp55VlYaKL32fY1GJNRV31n4m1QenUsXz4jJz4Onmatnu7rmQSvWqX0Y~h~ax8mCKkvZKYZVOY6sqyE4HRfFoIq3dcj~Tf-USEj~Zt-g-VruB-SE8xlztGWHw2zQ8koeyqQTMm5ANjBKv7iQNmCYzGGmacFSFSWrVRfl4vxuK9pWCIkVHqbRY3z4BNIQwW4iiQomczd5pf1D7OxRzGk6v~Ls9Q~GTIxiPSBGQ__&amp;Key-Pair-Id=APKAIE5G5CRDK6RD3PGA\">Supplementary data</a></span>&nbsp;are available at&nbsp;<em>Bioinformatics</em> online.</p>\n</div>",
        "doi": "10.1093/bioinformatics/btae331",
        "pmcid": "PMC11193061",
        "issn": "1367-4811",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2024-06-14",
        "series_number": "6",
        "volume": "40",
        "issue": "6",
        "pages": "btae331"
    },
    {
        "id": "authors:svck4-6m833",
        "collection": "authors",
        "collection_id": "svck4-6m833",
        "cite_using_url": "https://authors.library.caltech.edu/records/svck4-6m833",
        "type": "article",
        "title": "PSCA-CAR T cell therapy in metastatic castration-resistant prostate cancer: a phase 1 trial",
        "author": [
            {
                "family_name": "Dorff",
                "given_name": "Tanya B.",
                "orcid": "0000-0001-5990-298X"
            },
            {
                "family_name": "Blanchard",
                "given_name": "M. Suzette"
            },
            {
                "family_name": "Adkins",
                "given_name": "Lauren N."
            },
            {
                "family_name": "Luebbert",
                "given_name": "Laura",
                "orcid": "0000-0003-1379-2927",
                "clpid": "Luebbert-Laura"
            },
            {
                "family_name": "Leggett",
                "given_name": "Neena",
                "orcid": "0000-0002-1644-674X"
            },
            {
                "family_name": "Shishido",
                "given_name": "Stephanie N.",
                "orcid": "0000-0002-5949-0687"
            },
            {
                "family_name": "Macias",
                "given_name": "Alan"
            },
            {
                "family_name": "Del Real",
                "given_name": "Marissa M."
            },
            {
                "family_name": "Dhapola",
                "given_name": "Gaurav"
            },
            {
                "family_name": "Egelston",
                "given_name": "Colt",
                "orcid": "0000-0001-8440-1271"
            },
            {
                "family_name": "Murad",
                "given_name": "John P.",
                "orcid": "0000-0003-0637-2414"
            },
            {
                "family_name": "Rosa",
                "given_name": "Reginaldo",
                "orcid": "0000-0003-0984-8959"
            },
            {
                "family_name": "Paul",
                "given_name": "Jinny",
                "orcid": "0000-0002-6863-6406"
            },
            {
                "family_name": "Chaudhry",
                "given_name": "Ammar",
                "orcid": "0000-0002-2126-0587"
            },
            {
                "family_name": "Martirosyan",
                "given_name": "Hripsime"
            },
            {
                "family_name": "Gerdts",
                "given_name": "Ethan"
            },
            {
                "family_name": "Wagner",
                "given_name": "Jamie R.",
                "orcid": "0000-0001-7961-3253"
            },
            {
                "family_name": "Stiller",
                "given_name": "Tracey"
            },
            {
                "family_name": "Tilakawardane",
                "given_name": "Dileshni"
            },
            {
                "family_name": "Pal",
                "given_name": "Sumanta",
                "orcid": "0000-0002-1712-0848"
            },
            {
                "family_name": "Martinez",
                "given_name": "Catalina"
            },
            {
                "family_name": "Reiter",
                "given_name": "Robert E.",
                "orcid": "0000-0002-7962-3985"
            },
            {
                "family_name": "Budde",
                "given_name": "Lihua E.",
                "orcid": "0000-0003-1464-5494"
            },
            {
                "family_name": "D'Apuzzo",
                "given_name": "Massimo",
                "orcid": "0000-0001-8146-0997"
            },
            {
                "family_name": "Kuhn",
                "given_name": "Peter",
                "orcid": "0000-0003-2629-4505"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Forman",
                "given_name": "Stephen J.",
                "orcid": "0000-0002-2803-4152"
            },
            {
                "family_name": "Priceman",
                "given_name": "Saul J.",
                "orcid": "0000-0002-8136-2112"
            }
        ],
        "abstract": "<div class=\"c-article-section\">\n<div class=\"c-article-section__content\">\n<p>Despite recent therapeutic advances, metastatic castration-resistant prostate cancer (mCRPC) remains lethal. Chimeric antigen receptor (CAR) T cell therapies have demonstrated durable remissions in hematological malignancies. We report results from a phase 1, first-in-human study of prostate stem cell antigen (PSCA)-directed CAR T cells in men with mCRPC. The starting dose level (DL) was 100 million (M) CAR T cells without lymphodepletion (LD), followed by incorporation of LD. The primary end points were safety and dose-limiting toxicities (DLTs). No DLTs were observed at DL1, with a DLT of grade 3 cystitis encountered at DL2, resulting in addition of a new cohort using a reduced LD regimen&thinsp;+&thinsp;100&thinsp;M CAR T cells (DL3). No DLTs were observed in DL3. Cytokine release syndrome of grade 1 or 2 occurred in 5 of 14 treated patients. Prostate-specific antigen declines (&gt;30%) occurred in 4 of 14 patients, as well as radiographic improvements. Dynamic changes indicating activation of peripheral blood endogenous and CAR T cell subsets, TCR repertoire diversity and changes in the tumor immune microenvironment were observed in a subset of patients. Limited persistence of CAR T cells was observed beyond 28 days post-infusion. These results support future clinical studies to optimize dosing and combination strategies to improve durable therapeutic outcomes. ClinicalTrials.gov identifier&nbsp;<a href=\"https://clinicaltrials.gov/study/NCT03873805\">NCT03873805</a>.</p>\n</div>\n</div>\n\n<div class=\"main-content\">\n\n\n<div class=\"c-article-section\"></div>\n\n</div>",
        "doi": "10.1038/s41591-024-02979-8",
        "issn": "1078-8956",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Medicine",
        "publication_date": "2024-06-12"
    },
    {
        "id": "authors:gzq6b-6e424",
        "collection": "authors",
        "collection_id": "gzq6b-6e424",
        "cite_using_url": "https://authors.library.caltech.edu/records/gzq6b-6e424",
        "type": "article",
        "title": "A novel approach to comparative RNA-Seq does not support a conserved set of orthologs underlying animal regeneration",
        "author": [
            {
                "family_name": "Sierra",
                "given_name": "Noemie",
                "orcid": "0000-0003-1329-4733",
                "clpid": "Sierra-Noemie"
            },
            {
                "family_name": "Olsman",
                "given_name": "Noah",
                "orcid": "0000-0002-4351-3880",
                "clpid": "Olsman-Noah"
            },
            {
                "family_name": "Yi",
                "given_name": "Lynn",
                "orcid": "0000-0003-4575-0158",
                "clpid": "Yi-Lynn"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Goentoro",
                "given_name": "Lea",
                "orcid": "0000-0002-3904-0195",
                "clpid": "Goentoro-L-A"
            },
            {
                "family_name": "Gold",
                "given_name": "David A.",
                "orcid": "0000-0003-0135-4022",
                "clpid": "Gold-David-A"
            }
        ],
        "editor": [
            {
                "family_name": "Pisani",
                "given_name": "Davide"
            }
        ],
        "abstract": "<div class=\"copyright copyright-statement\">\n\n\n<p class=\"chapter-para\">Molecular studies of animal regeneration typically focus on conserved genes and signaling pathways that underlie morphogenesis. To date, a holistic analysis of gene expression across animals has not been attempted, as it presents a suite of problems related to differences in experimental design and gene homology. By combining orthology analyses with a novel statistical method for testing gene enrichment across large datasets, we are able to test whether tissue regeneration across animals share transcriptional regulation. We applied this method to a meta-analysis of 6 publicly available RNA-Seq datasets from diverse examples of animal regeneration. We recovered 160 conserved orthologous gene clusters, which are enriched in structural genes as opposed to those regulating morphogenesis. A breakdown of gene presence/absence provides limited support for the conservation of pathways typically implicated in regeneration, such as Wnt signaling and cell pluripotency pathways. Such pathways are only conserved if we permit large amounts of paralog switching through evolution. Overall, our analysis does not support the hypothesis that a shared set of ancestral genes underlie regeneration mechanisms in animals. After applying the same method to heat shock studies and getting similar results, we raise broader questions about the ability of comparative RNA-Seq to reveal conserved gene pathways across deep evolutionary relationships.</p>\n\n</div>",
        "doi": "10.1093/gbe/evae120",
        "pmcid": "PMC11214158",
        "issn": "1759-6653",
        "publisher": "Oxford University Press",
        "publication": "Genome Biology and Evolution",
        "publication_date": "2024-06",
        "series_number": "6",
        "volume": "16",
        "issue": "6",
        "pages": "evae120"
    },
    {
        "id": "authors:q7w89-9xr46",
        "collection": "authors",
        "collection_id": "q7w89-9xr46",
        "cite_using_url": "https://authors.library.caltech.edu/records/q7w89-9xr46",
        "type": "article",
        "title": "A machine-readable specification for genomics assays",
        "author": [
            {
                "family_name": "Booeshaghi",
                "given_name": "Ali Sina",
                "orcid": "0000-0002-6442-4502"
            },
            {
                "family_name": "Chen",
                "given_name": "Xi",
                "orcid": "0000-0003-2648-3146"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "<div class=\" sec\">\n<div class=\"title\">Motivation</div>\n<p class=\"chapter-para\">Understanding the structure of sequenced fragments from genomics libraries is essential for accurate read preprocessing. Currently, different assays and sequencing technologies require custom scripts and programs that do not leverage the common structure of sequence elements present in genomics libraries.</p>\n</div>\n<div class=\" sec\">\n<div class=\"title\">Results</div>\n<p class=\"chapter-para\">We present&nbsp;<em>seqspec</em>, a machine-readable specification for libraries produced by genomics assays that facilitates standardization of preprocessing and enables tracking and comparison of genomics assays.</p>\n</div>\n<div class=\" sec\">\n<div class=\"title\">Availability and implementation</div>\n<p class=\"chapter-para\">The specification and associated&nbsp;<em>seqspec</em>&nbsp;command line tool is available at&nbsp;<a class=\"link link-uri openInAnotherWindow\" href=\"https://www.doi.org/10.5281/zenodo.10213865\" rel=\"noopener\">https://www.doi.org/10.5281/zenodo.10213865</a>.</p>\n</div>",
        "doi": "10.1093/bioinformatics/btae168",
        "pmcid": "PMC11009023",
        "issn": "1367-4811",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2024-04",
        "series_number": "4",
        "volume": "40",
        "issue": "4",
        "pages": "btae168"
    },
    {
        "id": "authors:y0p8m-yqm72",
        "collection": "authors",
        "collection_id": "y0p8m-yqm72",
        "cite_using_url": "https://authors.library.caltech.edu/records/y0p8m-yqm72",
        "type": "article",
        "title": "Fast and scalable querying of eukaryotic linear motifs with gget elm",
        "author": [
            {
                "family_name": "Luebbert",
                "given_name": "Laura",
                "orcid": "0000-0003-1379-2927",
                "clpid": "Luebbert-Laura"
            },
            {
                "family_name": "Hoang",
                "given_name": "Chi",
                "orcid": "0000-0003-0068-4898",
                "clpid": "Hoang-Chi"
            },
            {
                "family_name": "Kumar",
                "given_name": "Manjeet",
                "orcid": "0000-0002-3004-2151",
                "clpid": "Kumar-Manjeet"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "<div class=\"sec sec-first\">\n<h3>Motivation</h3>\n<p class=\"p p-first-last\">Eukaryotic linear motifs (ELMs), or Short Linear Motifs, are protein interaction modules that play an essential role in cellular processes and signaling networks and are often involved in diseases like cancer. The ELM database is a collection of manually curated motif knowledge from scientific papers. It has become a crucial resource for investigating motif biology and recognizing candidate ELMs in novel amino acid sequences. Users can search amino acid sequences or UniProt Accessions on the ELM resource web interface. However, as with many web services, there are limitations in the swift processing of large-scale queries through the ELM web interface or API calls, and, therefore, integration into protein function analysis pipelines is limited.</p>\n</div>\n<div class=\"sec\">\n<h3>Results</h3>\n<p class=\"p p-first-last\">To allow swift, large-scale motif analyses on protein sequences using ELMs curated in the ELM database, we have extended the&nbsp;<em>gget</em>&nbsp;suite of Python and command line tools with a new module,&nbsp;<em>gget elm</em>, which does not rely on the ELM server for efficiently finding candidate ELMs in user-submitted amino acid sequences and UniProt Accessions<em>. gget elm</em>&nbsp;increases accessibility to the information stored in the ELM database and allows scalable searches for motif-mediated interaction sites in the amino acid sequences.</p>\n</div>\n<div class=\"sec sec-last\">\n<h3>Availability and implementation</h3>\n<p class=\"p p-first-last\">The manual and source code are available at&nbsp;<a href=\"https://github.com/pachterlab/gget\" rel=\"noopener\">https://github.com/pachterlab/gget</a>.</p>\n</div>",
        "doi": "10.1093/bioinformatics/btae095",
        "pmcid": "PMC10927331",
        "issn": "1367-4811",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2024-03",
        "series_number": "3",
        "volume": "40",
        "issue": "3",
        "pages": "btae095"
    },
    {
        "id": "authors:4bdzc-r3y87",
        "collection": "authors",
        "collection_id": "4bdzc-r3y87",
        "cite_using_url": "https://authors.library.caltech.edu/records/4bdzc-r3y87",
        "type": "article",
        "title": "New and notable: Revisiting the \"two cultures\" through extrinsic noise",
        "author": [
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-Gennady"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "<p>In a classic article (<a class=\"anchor u-display-inline anchor-paragraph\" href=\"https://www.sciencedirect.com/science/article/pii/S0006349523041164?via%3Dihub#bib1\"><span class=\"anchor-text\">1</span></a>), Leo Breiman bears witness to the divergence between &ldquo;two cultures&rdquo; of statistics that emerged in the wake of readily accessible computing technology: the data modeling culture, which concerns itself with developing and fitting stochastic models, and the algorithmic modeling culture, which concerns itself with improving predictive accuracy without delving into unknown (and perhaps unknowable) mechanisms. More than two decades later, the distinct cultures of statistics are evident in approaches to single-molecule transcriptomics. The biophysics subfield focuses on assays that target a small number of genes and develops increasingly sophisticated mechanistic models, whereas the sequence census subfield uses descriptive, data-scientific methods such as those championed by Breiman.</p>",
        "doi": "10.1016/j.bpj.2023.11.3400",
        "issn": "0006-3495",
        "publisher": "Cell Press",
        "publication": "Biophysical Journal",
        "publication_date": "2024-01-02",
        "series_number": "1",
        "volume": "123",
        "issue": "1",
        "pages": "1-3"
    },
    {
        "id": "authors:axtf4-cb576",
        "collection": "authors",
        "collection_id": "axtf4-cb576",
        "cite_using_url": "https://authors.library.caltech.edu/records/axtf4-cb576",
        "type": "article",
        "title": "Quantifying orthogonal barcodes for sequence census assays",
        "author": [
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Min",
                "given_name": "Kyung Hoi (Joseph)",
                "orcid": "0000-0003-0894-4017"
            },
            {
                "family_name": "Gehring",
                "given_name": "Jase",
                "orcid": "0000-0002-3894-9495",
                "clpid": "Gehring-Jase"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "<div class=\" sec\">\n<p class=\"chapter-para\">Barcode-based sequence census assays utilize custom or random oligonucloetide sequences to label various biological features, such as cell-surface proteins or CRISPR perturbations. These assays all rely on barcode quantification, a task that is complicated by barcode design and technical noise. We introduce a modular approach to quantifying barcodes that achieves speed and memory improvements over existing tools. We also introduce a set of quality control metrics, and accompanying tool, for validating barcode designs.</p>\n</div>",
        "doi": "10.1093/bioadv/vbad181",
        "pmcid": "PMC10783946",
        "issn": "2635-0041",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics Advances",
        "publication_date": "2024",
        "volume": "4",
        "pages": "vbad181"
    },
    {
        "id": "authors:ntw84-7bx60",
        "collection": "authors",
        "collection_id": "ntw84-7bx60",
        "cite_using_url": "https://authors.library.caltech.edu/records/ntw84-7bx60",
        "type": "article",
        "title": "Direct androgen receptor control of sexually dimorphic gene expression in the mammalian kidney",
        "author": [
            {
                "family_name": "Xiong",
                "given_name": "Lingyun",
                "orcid": "0000-0003-4594-4120",
                "clpid": "Xiong-Lingyun"
            },
            {
                "family_name": "Liu",
                "given_name": "Jing",
                "clpid": "Liu-Jing"
            },
            {
                "family_name": "Han",
                "given_name": "Seung Yub",
                "clpid": "Han-Seung-Yub"
            },
            {
                "family_name": "Koppitch",
                "given_name": "Kari",
                "clpid": "Koppitch-Kari"
            },
            {
                "family_name": "Guo",
                "given_name": "Jin-Jin",
                "clpid": "Guo-Jin-Jin"
            },
            {
                "family_name": "Rommelfanger",
                "given_name": "Megan",
                "orcid": "0000-0003-3071-7419",
                "clpid": "Rommelfanger-Megan-K"
            },
            {
                "family_name": "Miao",
                "given_name": "Zhen",
                "orcid": "0000-0002-3255-9517",
                "clpid": "Miao-Zhen"
            },
            {
                "family_name": "Gao",
                "given_name": "Fan",
                "orcid": "0000-0001-6832-3402",
                "clpid": "Gao-Fan"
            },
            {
                "family_name": "Hallgrimsdottir",
                "given_name": "Ingileif B.",
                "orcid": "0000-0002-4710-0047",
                "clpid": "Hallgrimsdottir-Ingileif-B"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior S.",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Kim",
                "given_name": "Junhyong",
                "orcid": "0000-0002-7726-8246",
                "clpid": "Kim-Junhyong"
            },
            {
                "family_name": "MacLean",
                "given_name": "Adam L.",
                "orcid": "0000-0003-0689-7907",
                "clpid": "MacLean-Adam-L"
            },
            {
                "family_name": "McMahon",
                "given_name": "Andrew P.",
                "orcid": "0000-0002-3779-1729",
                "clpid": "McMahon-Andrew-P"
            }
        ],
        "abstract": "<p>Mammalian organs exhibit distinct physiology, disease susceptibility, and injury responses between the sexes. In the mouse kidney, sexually dimorphic gene activity maps predominantly to proximal tubule (PT) segments. Bulk <a href=\"https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/rna-sequence\">RNA sequencing</a> (RNA-seq) data demonstrated that sex differences were established from 4 and 8&nbsp;weeks after birth under gonadal control. Hormone injection studies and <a href=\"https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/genetics\">genetic</a> removal of androgen and estrogen receptors demonstrated androgen receptor (AR)-mediated regulation of gene activity in PT cells as the regulatory mechanism. Interestingly, caloric restriction feminizes the male kidney. Single-nuclear multiomic analysis identified putative <i>cis</i>-regulatory regions and cooperating factors mediating PT responses to AR activity in the mouse kidney. In the human kidney, a limited set of genes showed conserved sex-linked regulation, whereas analysis of the mouse liver underscored organ-specific differences in the regulation of sexually dimorphic gene expression. These findings raise interesting questions on the evolution, physiological significance, disease, and metabolic linkage of sexually dimorphic gene activity.</p>",
        "doi": "10.1016/j.devcel.2023.08.010",
        "pmcid": "PMC10873092",
        "issn": "1878-1551",
        "publisher": "Cell Press",
        "publication": "Developmental Cell",
        "publication_date": "2023-11-06",
        "series_number": "21",
        "volume": "58",
        "issue": "21",
        "pages": "2338-2358.e5"
    },
    {
        "id": "authors:5z5v2-jjy66",
        "collection": "authors",
        "collection_id": "5z5v2-jjy66",
        "cite_using_url": "https://authors.library.caltech.edu/records/5z5v2-jjy66",
        "type": "article",
        "title": "Assessing Markovian and Delay Models for Single-Nucleus RNA Sequencing",
        "author": [
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-Gennady"
            },
            {
                "family_name": "Yoshida",
                "given_name": "Shawn",
                "orcid": "0000-0002-0866-2741",
                "clpid": "Yoshida-Shawn-R"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "<p>The serial nature of reactions involved in the RNA life-cycle motivates the incorporation of delays in models of transcriptional dynamics. The models couple a transcriptional process to a fairly general set of delayed monomolecular reactions with no feedback. We provide numerical strategies for calculating the RNA copy number distributions induced by these models, and solve several systems with splicing, degradation, and catalysis. An analysis of single-cell and single-nucleus RNA sequencing data using these models reveals that the kinetics of nuclear export do not appear to require invocation of a non-Markovian waiting time.</p>",
        "doi": "10.1007/s11538-023-01213-9",
        "issn": "0092-8240",
        "publisher": "Springer Nature",
        "publication": "Bulletin of Mathematical Biology",
        "publication_date": "2023-11",
        "series_number": "11",
        "volume": "85",
        "issue": "11",
        "pages": "114"
    },
    {
        "id": "authors:x9gbd-0gk44",
        "collection": "authors",
        "collection_id": "x9gbd-0gk44",
        "cite_using_url": "https://authors.library.caltech.edu/records/x9gbd-0gk44",
        "type": "article",
        "title": "Studying stochastic systems biology of the cell with single-cell genomics data",
        "author": [
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-Gennady"
            },
            {
                "family_name": "Vastola",
                "given_name": "John J.",
                "orcid": "0000-0002-5625-2106",
                "clpid": "Vastola-John-J"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "<p>Recent experimental developments in genome-wide <a href=\"https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/rna\">RNA</a> quantification hold considerable promise for systems biology. However, rigorously probing the biology of living cells requires a unified mathematical framework that accounts for single-molecule biological stochasticity in the context of technical variation associated with genomics assays. We review models for a variety of <a href=\"https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/rna-transcription\">RNA transcription</a> processes, as well as the encapsulation and library construction steps of microfluidics-based single-cell <a href=\"https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/rna-sequence\">RNA sequencing</a>, and present a framework to integrate these phenomena by the manipulation of generating functions. Finally, we use simulated scenarios and biological data to illustrate the implications and applications of the approach.</p>",
        "doi": "10.1016/j.cels.2023.08.004",
        "pmcid": "PMC10725240",
        "issn": "2405-4712",
        "publisher": "Cell Press",
        "publication": "Cell Systems",
        "publication_date": "2023-10-18",
        "series_number": "10",
        "volume": "14",
        "issue": "10",
        "pages": "822-843.e22"
    },
    {
        "id": "authors:hs9mc-jb762",
        "collection": "authors",
        "collection_id": "hs9mc-jb762",
        "cite_using_url": "https://authors.library.caltech.edu/records/hs9mc-jb762",
        "type": "article",
        "title": "Author Correction: Principles of open source bioinstrumentation applied to the poseidon syringe pump system",
        "author": [
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Beltrame",
                "given_name": "Eduardo da Veiga",
                "orcid": "0000-0002-1529-9207",
                "clpid": "Beltrame-Eduardo-da-Veiga"
            },
            {
                "family_name": "Bannon",
                "given_name": "Dylan",
                "clpid": "Bannon-Dylan"
            },
            {
                "family_name": "Gehring",
                "given_name": "Jase",
                "orcid": "0000-0002-3894-9495",
                "clpid": "Gehring-Jase"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "<p>Correction to: <i>Scientific Reports</i> <a href=\"https://doi.org/10.1038/s41598-019-48815-9\">https://doi.org/10.1038/s41598-019-48815-9</a>, published online 27 August 2019</p><p>This Article contains an error in Figure&nbsp;4, where the replotting of a subset of data in Figure&nbsp;4a, which pertain to the Harvard dataset is incorrect in panels (1) and (3). The correct Figure&nbsp;<a href=\"https://www.nature.com/articles/s41598-023-42035-y#Fig4\">4</a> and accompanying legend appear below.</p>",
        "doi": "10.1038/s41598-023-42035-y",
        "pmcid": "PMC10491597",
        "issn": "2045-2322",
        "publisher": "Nature Publishing Group",
        "publication": "Scientific Reports",
        "publication_date": "2023-09-08",
        "volume": "13",
        "pages": "14834"
    },
    {
        "id": "authors:ewrjt-pbk58",
        "collection": "authors",
        "collection_id": "ewrjt-pbk58",
        "cite_using_url": "https://authors.library.caltech.edu/records/ewrjt-pbk58",
        "type": "article",
        "title": "A standard for sharing spatial transcriptomics data",
        "author": [
            {
                "family_name": "Jackson",
                "given_name": "Kayla C.",
                "orcid": "0000-0001-6483-0108",
                "clpid": "Jackson-Kayla-C"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "<p>Spatial transcriptomic technologies have the potential to reveal critical relationships between the function of genes and cells and their spatial organization. Here, we provide a sharing model for spatial transcriptomics data with the aim of establishing a set of primary data and metadata needed to reproduce analyses and facilitate computational methods development.</p>",
        "doi": "10.1016/j.xgen.2023.100374",
        "pmcid": "PMC10435375",
        "issn": "2666-979X",
        "publisher": "Cell Press",
        "publication": "Cell Genomics",
        "publication_date": "2023-08-09",
        "series_number": "8",
        "volume": "3",
        "issue": "8",
        "pages": "100374"
    },
    {
        "id": "authors:fzh9v-hjh15",
        "collection": "authors",
        "collection_id": "fzh9v-hjh15",
        "cite_using_url": "https://authors.library.caltech.edu/records/fzh9v-hjh15",
        "type": "article",
        "title": "The specious art of single-cell genomics",
        "author": [
            {
                "family_name": "Chari",
                "given_name": "Tara",
                "orcid": "0000-0002-6953-4313",
                "clpid": "Chari-Tara"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "editor": [
            {
                "family_name": "Papin",
                "given_name": "Jason A."
            }
        ],
        "abstract": "Dimensionality reduction is standard practice for filtering noise and identifying relevant features in large-scale data analyses. In biology, single-cell genomics studies typically begin with reduction to 2 or 3 dimensions to produce \"all-in-one\" visuals of the data that are amenable to the human eye, and these are subsequently used for qualitative and quantitative exploratory analysis. However, there is little theoretical support for this practice, and we show that extreme dimension reduction, from hundreds or thousands of dimensions to 2, inevitably induces significant distortion of high-dimensional datasets. We therefore examine the practical implications of low-dimensional embedding of single-cell data and find that extensive distortions and inconsistent practices make such embeddings counter-productive for exploratory, biological analyses. In lieu of this, we discuss alternative approaches for conducting targeted embedding and feature exploration to enable hypothesis-driven biological discovery.",
        "doi": "10.1371/journal.pcbi.1011288",
        "pmcid": "PMC10434946",
        "issn": "1553-7358",
        "publisher": "Public Library of Science",
        "publication": "PLOS Computational Biology",
        "publication_date": "2023-08",
        "series_number": "8",
        "volume": "19",
        "issue": "8",
        "pages": "e1011288"
    },
    {
        "id": "authors:nvj6h-bzw14",
        "collection": "authors",
        "collection_id": "nvj6h-bzw14",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20230502-708586900.1",
        "type": "article",
        "title": "Cell-specific occupancy dynamics between the pioneer-like factor Opa/ZIC and Ocelliless/OTX regulate early head development in embryos",
        "author": [
            {
                "family_name": "Fenelon",
                "given_name": "Kelli D.",
                "orcid": "0000-0002-1294-9200",
                "clpid": "Fenelon-Kelli-D"
            },
            {
                "family_name": "Gao",
                "given_name": "Fan",
                "orcid": "0000-0001-6832-3402",
                "clpid": "Gao-Fan"
            },
            {
                "family_name": "Borad",
                "given_name": "Priyanshi",
                "orcid": "0000-0001-8446-5312",
                "clpid": "Borad-Priyanshi"
            },
            {
                "family_name": "Abbasi",
                "given_name": "Shiva",
                "orcid": "0000-0002-6470-335X",
                "clpid": "Abbasi-Shiva"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Koromila",
                "given_name": "Theodora",
                "orcid": "0000-0001-5504-1369",
                "clpid": "Koromila-Theodora"
            }
        ],
        "abstract": "During development, embryonic patterning systems direct a set of initially uncommitted pluripotent cells to differentiate into a variety of cell types and tissues. A core network of transcription factors, such as Zelda/POU5F1, Odd-paired (Opa)/ZIC3 and Ocelliless (Oc)/OTX2, are conserved across animals. While Opa is essential for a second wave of zygotic activation after Zelda, it is unclear whether Opa drives head cell specification, in the Drosophila embryo. Our hypothesis is that Opa and Oc are interacting with distinct cis-regulatory regions for shaping cell fates in the embryonic head. Super-resolution microscopy and meta-analysis of single-cell RNAseq datasets show that opa's and oc's overlapping expression domains are dynamic in the head region, with both factors being simultaneously transcribed at the blastula stage. Additionally, analysis of single-embryo RNAseq data reveals a subgroup of Opa-bound genes to be Opa-independent in the cellularized embryo. Interrogation of these genes against Oc ChIPseq combined with in situ data, suggests that Opa is competing with Oc for the regulation of a subgroup of genes later in gastrulation. Specifically, we find that Oc binds to late, head-specific enhancers independently and activates them in a head-specific wave of zygotic transcription, suggesting distinct roles for Oc in the blastula and gastrula stages.",
        "doi": "10.3389/fcell.2023.1126507",
        "pmcid": "PMC10083704",
        "issn": "2296-634X",
        "publisher": "Frontiers Media",
        "publication": "Frontiers in Cell and Developmental Biology",
        "publication_date": "2023-03-27",
        "volume": "11",
        "pages": "Art. No. 1126507"
    },
    {
        "id": "authors:nz49t-npq98",
        "collection": "authors",
        "collection_id": "nz49t-npq98",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20230725-706344000.34",
        "type": "article",
        "title": "Efficient querying of genomic reference databases with gget",
        "author": [
            {
                "family_name": "Luebbert",
                "given_name": "Laura",
                "orcid": "0000-0003-1379-2927",
                "clpid": "Luebbert-Laura"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Motivation: A recurring challenge in interpreting genomic data is the assessment of results in the context of existing reference databases. With the increasing number of command line and Python users, there is a need for tools implementing automated, easy programmatic access to curated reference information stored in a diverse collection of large, public genomic databases. \n\nResults: gget is a free and open-source command line tool and Python package that enables efficient querying of genomic reference databases, such as Ensembl. gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying required for genomic data analysis in a single line of code.",
        "doi": "10.1093/bioinformatics/btac836",
        "pmcid": "PMC9835474",
        "issn": "1367-4811",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2023-01-01",
        "series_number": "1",
        "volume": "39",
        "issue": "1",
        "pages": "Art. No. btac836"
    },
    {
        "id": "authors:ryza3-50d52",
        "collection": "authors",
        "collection_id": "ryza3-50d52",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20230411-694477200.2",
        "type": "article",
        "title": "Metadata retrieval from sequence databases with ffq",
        "author": [
            {
                "family_name": "G\u00e1lvez-Merch\u00e1n",
                "given_name": "\u00c1ngel",
                "orcid": "0000-0001-7420-8697",
                "clpid": "G\u00e1lvez-Merch\u00e1n-\u00c1ngel"
            },
            {
                "family_name": "Min",
                "given_name": "Kyung Hoi (Joseph)",
                "orcid": "0000-0003-0894-4017",
                "clpid": "Min-Kyung-Hoi-Joseph"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            }
        ],
        "abstract": "Motivation: Several genomic databases host data and metadata for an ever-growing collection of sequence datasets. While these databases have a shared hierarchical structure, there are no tools specifically designed to leverage it for metadata extraction.\n\nResults: We present a command-line tool, called ffq, for querying user-generated data and metadata from sequence databases. Given an accession or a paper's DOI, ffq efficiently fetches metadata and links to raw data in JSON format. ffq's modularity and simplicity make it extensible to any genomic database exposing its data for programmatic access.\n\nAvailability and implementation: ffq is free and open source, and the code can be found here: https://github.com/pachterlab/ffq.",
        "doi": "10.1093/bioinformatics/btac667",
        "pmcid": "PMC9883619",
        "issn": "1367-4811",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2023-01",
        "series_number": "1",
        "volume": "39",
        "issue": "1",
        "pages": "Art. No. btac667"
    },
    {
        "id": "authors:d6ppt-egs07",
        "collection": "authors",
        "collection_id": "d6ppt-egs07",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20230622-883274000.1",
        "type": "article",
        "title": "Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments",
        "author": [
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-Gennady"
            },
            {
                "family_name": "Vastola",
                "given_name": "John J.",
                "orcid": "0000-0002-5625-2106",
                "clpid": "Vastola-John-J"
            },
            {
                "family_name": "Fang",
                "given_name": "Meichen",
                "orcid": "0000-0002-8217-0710",
                "clpid": "Fang-Meichen"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The question of how cell-to-cell differences in transcription rate affect RNA count distributions is fundamental for understanding biological processes underlying transcription. Answering this question requires quantitative models that are both interpretable (describing concrete biophysical phenomena) and tractable (amenable to mathematical analysis). This enables the identification of experiments which best discriminate between competing hypotheses. As a proof of principle, we introduce a simple but flexible class of models involving a continuous stochastic transcription rate driving a discrete RNA transcription and splicing process, and compare and contrast two biologically plausible hypotheses about transcription rate variation. One assumes variation is due to DNA experiencing mechanical strain, while the other assumes it is due to regulator number fluctuations. We introduce a framework for numerically and analytically studying such models, and apply Bayesian model selection to identify candidate genes that show signatures of each model in single-cell transcriptomic data from mouse glutamatergic neurons.",
        "doi": "10.1038/s41467-022-34857-7",
        "pmcid": "PMC9734650",
        "issn": "2041-1723",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Communications",
        "publication_date": "2022-12-09",
        "volume": "13",
        "pages": "Art. No. 7620"
    },
    {
        "id": "authors:sgfnt-bab08",
        "collection": "authors",
        "collection_id": "sgfnt-bab08",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20220916-665804000.785",
        "type": "article",
        "title": "RNA velocity unraveled",
        "author": [
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-Gennady"
            },
            {
                "family_name": "Fang",
                "given_name": "Meichen",
                "orcid": "0000-0002-8217-0710",
                "clpid": "Fang-Meichen"
            },
            {
                "family_name": "Chari",
                "given_name": "Tara",
                "orcid": "0000-0002-6953-4313",
                "clpid": "Chari-Tara"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We perform a thorough analysis of RNA velocity methods, with a view towards understanding the suitability of the various assumptions underlying popular implementations. In addition to providing a self-contained exposition of the underlying mathematics, we undertake simulations and perform controlled experiments on biological datasets to assess workflow sensitivity to parameter choices and underlying biology. Finally, we argue for a more rigorous approach to RNA velocity, and present a framework for Markovian analysis that points to directions for improvement and mitigation of current problems.",
        "doi": "10.1371/journal.pcbi.1010492",
        "pmcid": "PMC9499228",
        "issn": "1553-7358",
        "publisher": "Public Library of Science",
        "publication": "PLOS Computational Biology",
        "publication_date": "2022-09",
        "series_number": "9",
        "volume": "18",
        "issue": "9",
        "pages": "Art. No. e1010492"
    },
    {
        "id": "authors:m4wef-4m072",
        "collection": "authors",
        "collection_id": "m4wef-4m072",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210513-122736659",
        "type": "article",
        "title": "Museum of spatial transcriptomics",
        "author": [
            {
                "family_name": "Moses",
                "given_name": "Lambda",
                "orcid": "0000-0002-7092-9427",
                "clpid": "Moses-Lambda"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The function of many biological systems, such as embryos, liver lobules, intestinal villi, and tumors, depends on the spatial organization of their cells. In the past decade, high-throughput technologies have been developed to quantify gene expression in space, and computational methods have been developed that leverage spatial gene expression data to identify genes with spatial patterns and to delineate neighborhoods within tissues. To comprehensively document spatial gene expression technologies and data-analysis methods, we present a curated review of literature on spatial transcriptomics dating back to 1987, along with a thorough analysis of trends in the field, such as usage of experimental techniques, species, tissues studied, and computational approaches used. Our Review places current methods in a historical context, and we derive insights about the field that can guide current research strategies. A companion supplement offers a more detailed look at the technologies and methods analyzed: https://pachterlab.github.io/LP_2021/.",
        "doi": "10.1038/s41592-022-01409-2",
        "issn": "1548-7091",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Methods",
        "publication_date": "2022-05",
        "series_number": "5",
        "volume": "19",
        "issue": "5",
        "pages": "534-546"
    },
    {
        "id": "authors:myke3-zsv09",
        "collection": "authors",
        "collection_id": "myke3-zsv09",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210325-075042340",
        "type": "article",
        "title": "Modeling bursty transcription and splicing with the chemical master equation",
        "author": [
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-Gennady"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Splicing cascades that alter gene products posttranscriptionally also affect expression dynamics. We study a class of processes and associated distributions that emerge from models of bursty promoters coupled to directed acyclic graphs of splicing. These solutions provide full time-dependent joint distributions for an arbitrary number of species with general noise behaviors and transient phenomena, offering qualitative and quantitative insights about how splicing can regulate expression dynamics. Finally, we derive a set of quantitative constraints on the minimum complexity necessary to reproduce gene coexpression patterns using synchronized burst models. We validate these findings by analyzing long-read sequencing data, where we find evidence of expression patterns largely consistent with these constraints.",
        "doi": "10.1016/j.bpj.2022.02.004",
        "issn": "0006-3495",
        "publisher": "Cell Press",
        "publication": "Biophysical Journal",
        "publication_date": "2022-03-15",
        "series_number": "6",
        "volume": "121",
        "issue": "6",
        "pages": "1056-1069"
    },
    {
        "id": "authors:jy3nx-j5b58",
        "collection": "authors",
        "collection_id": "jy3nx-j5b58",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210503-142332959",
        "type": "article",
        "title": "A Python library for probabilistic analysis of single-cell omics data",
        "author": [
            {
                "family_name": "Gayoso",
                "given_name": "Adam",
                "orcid": "0000-0001-9537-0845",
                "clpid": "Gayoso-Adam"
            },
            {
                "family_name": "Lopez",
                "given_name": "Romain",
                "orcid": "0000-0003-0495-738X",
                "clpid": "Lopez-Romain"
            },
            {
                "family_name": "Xing",
                "given_name": "Galen",
                "orcid": "0000-0001-7376-6312",
                "clpid": "Xing-Galen"
            },
            {
                "family_name": "Boyeau",
                "given_name": "Pierre",
                "orcid": "0000-0003-4549-3972",
                "clpid": "Boyeau-Pierre"
            },
            {
                "family_name": "Amiri",
                "given_name": "Valeh Valiollah Pour",
                "orcid": "0000-0002-2008-5297",
                "clpid": "Amiri-Valeh-Valiollah-Pour"
            },
            {
                "family_name": "Hong",
                "given_name": "Justin",
                "orcid": "0000-0003-2115-9101",
                "clpid": "Hong-Justin"
            },
            {
                "family_name": "Wu",
                "given_name": "Katherine",
                "orcid": "0000-0001-7562-4545",
                "clpid": "Wu-Katherine"
            },
            {
                "family_name": "Jayasuriya",
                "given_name": "Michael",
                "orcid": "0000-0003-2366-841X",
                "clpid": "Jayasuriya-Michael"
            },
            {
                "family_name": "Mehlman",
                "given_name": "Edouard",
                "orcid": "0000-0001-6351-2220",
                "clpid": "Mehlman-Edouard"
            },
            {
                "family_name": "Langevin",
                "given_name": "Maxime",
                "orcid": "0000-0002-5498-4661",
                "clpid": "Langevin-Maxime"
            },
            {
                "family_name": "Liu",
                "given_name": "Yining",
                "orcid": "0000-0002-8779-2906",
                "clpid": "Liu-Yining"
            },
            {
                "family_name": "Samaran",
                "given_name": "Jules",
                "orcid": "0000-0001-7317-8190",
                "clpid": "Samaran-Jules"
            },
            {
                "family_name": "Misrachi",
                "given_name": "Gabriel",
                "orcid": "0000-0002-6020-4641",
                "clpid": "Misrachi-Gabriel"
            },
            {
                "family_name": "Nazaret",
                "given_name": "Achille",
                "orcid": "0000-0002-5428-9810",
                "clpid": "Nazaret-Achille"
            },
            {
                "family_name": "Clivio",
                "given_name": "Oscar",
                "orcid": "0000-0001-8668-4535",
                "clpid": "Clivio-Oscar"
            },
            {
                "family_name": "Xu",
                "given_name": "Chenling",
                "orcid": "0000-0001-9610-7627",
                "clpid": "Xu-Chenling"
            },
            {
                "family_name": "Ashuach",
                "given_name": "Tal",
                "orcid": "0000-0003-1939-0865",
                "clpid": "Ashuach-Tal"
            },
            {
                "family_name": "Gabitto",
                "given_name": "Mariano",
                "orcid": "0000-0001-6911-344X",
                "clpid": "Gabitto-Mariano"
            },
            {
                "family_name": "Lotfollahi",
                "given_name": "Mohammad",
                "orcid": "0000-0001-6858-7985",
                "clpid": "Lotfollahi-Mohammad"
            },
            {
                "family_name": "Svensson",
                "given_name": "Valentine",
                "orcid": "0000-0002-9217-2330",
                "clpid": "Svensson-Valentine"
            },
            {
                "family_name": "da Veiga Beltrame",
                "given_name": "Eduardo",
                "orcid": "0000-0002-1529-9207",
                "clpid": "da-Veiga-Beltrame-Eduardo"
            },
            {
                "family_name": "Kleshchevnikov",
                "given_name": "Vitalii",
                "orcid": "0000-0001-9110-7441",
                "clpid": "Kleshchevnikov-Vitalii"
            },
            {
                "family_name": "Talavera-L\u00f3pez",
                "given_name": "Carlos",
                "orcid": "0000-0001-8590-2393",
                "clpid": "Talavera-L\u00f3pez-Carlos"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Theis",
                "given_name": "Fabian J.",
                "orcid": "0000-0002-2419-1943",
                "clpid": "Theis-Fabian-J"
            },
            {
                "family_name": "Streets",
                "given_name": "Aaron",
                "orcid": "0000-0002-3909-8389",
                "clpid": "Streets-Aaron-M"
            },
            {
                "family_name": "Jordan",
                "given_name": "Michael I.",
                "orcid": "0000-0001-8935-817X",
                "clpid": "Jordan-Michael-I"
            },
            {
                "family_name": "Regier",
                "given_name": "Jeffrey",
                "orcid": "0000-0002-1472-5235",
                "clpid": "Regier-Jeffrey"
            },
            {
                "family_name": "Yosef",
                "given_name": "Nir",
                "orcid": "0000-0001-9004-1225",
                "clpid": "Yosef-Nir"
            }
        ],
        "abstract": "Methods for analyzing single-cell data perform a core set of computational tasks. These tasks include dimensionality reduction, cell clustering, cell-state annotation, removal of unwanted variation, analysis of differential expression, identification of spatial patterns of gene expression, and joint analysis of multi-modal omics data. Many of these methods rely on likelihood-based models to represent variation in the data; we refer to these as 'probabilistic models'. Probabilistic models provide principled ways to capture uncertainty in biological systems and are convenient for decomposing the many sources of variation that give rise to omics data.",
        "doi": "10.1038/s41587-021-01206-w",
        "issn": "1087-0156",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Biotechnology",
        "publication_date": "2022-02",
        "series_number": "2",
        "volume": "40",
        "issue": "2",
        "pages": "163-166"
    },
    {
        "id": "authors:5afgq-qgg51",
        "collection": "authors",
        "collection_id": "5afgq-qgg51",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210126-133110736",
        "type": "article",
        "title": "Whole-animal multiplexed single-cell RNA-seq reveals transcriptional shifts across Clytia medusa cell types",
        "author": [
            {
                "family_name": "Chari",
                "given_name": "Tara",
                "orcid": "0000-0002-6953-4313",
                "clpid": "Chari-Tara"
            },
            {
                "family_name": "Weissbourd",
                "given_name": "Brandon",
                "orcid": "0000-0001-5422-3873",
                "clpid": "Weissbourd-Brandon"
            },
            {
                "family_name": "Gehring",
                "given_name": "Jase",
                "orcid": "0000-0002-3894-9495",
                "clpid": "Gehring-Jase"
            },
            {
                "family_name": "Ferraioli",
                "given_name": "Anna",
                "orcid": "0000-0003-1817-6891",
                "clpid": "Ferraioli-Anna"
            },
            {
                "family_name": "Lecl\u00e8re",
                "given_name": "Lucas",
                "orcid": "0000-0002-7440-0467",
                "clpid": "Lecl\u00e8re-Lucas"
            },
            {
                "family_name": "Herl",
                "given_name": "Makenna",
                "orcid": "0000-0001-8518-5179",
                "clpid": "Herl-Makenna"
            },
            {
                "family_name": "Gao",
                "given_name": "Fan",
                "orcid": "0000-0001-6832-3402",
                "clpid": "Gao-Fan"
            },
            {
                "family_name": "Chevalier",
                "given_name": "Sandra",
                "orcid": "0000-0002-2717-6925",
                "clpid": "Chevalier-Sandra"
            },
            {
                "family_name": "Copley",
                "given_name": "Richard R.",
                "orcid": "0000-0001-7846-4954",
                "clpid": "Copley-Richard-R"
            },
            {
                "family_name": "Houliston",
                "given_name": "Evelyn",
                "orcid": "0000-0001-9264-2585",
                "clpid": "Houliston-Evelyn"
            },
            {
                "family_name": "Anderson",
                "given_name": "David J.",
                "orcid": "0000-0001-6175-3872",
                "clpid": "Anderson-D-J"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We present an organism-wide, transcriptomic cell atlas of the hydrozoan medusa Clytia hemisphaerica and describe how its component cell types respond to perturbation. Using multiplexed single-cell RNA sequencing, in which individual animals were indexed and pooled from control and perturbation conditions into a single sequencing run, we avoid artifacts from batch effects and are able to discern shifts in cell state in response to organismal perturbations. This work serves as a foundation for future studies of development, function, and regeneration in a genetically tractable jellyfish species. Moreover, we introduce a powerful workflow for high-resolution, whole-animal, multiplexed single-cell genomics that is readily adaptable to other traditional or nontraditional model organisms.",
        "doi": "10.1126/sciadv.abh1683",
        "pmcid": "PMC8626072",
        "issn": "2375-2548",
        "publisher": "American Association for the Advancement of Science",
        "publication": "Science Advances",
        "publication_date": "2021-11-26",
        "series_number": "48",
        "volume": "7",
        "issue": "48",
        "pages": "Art. No. eabh1683"
    },
    {
        "id": "authors:mn91e-p6r12",
        "collection": "authors",
        "collection_id": "mn91e-p6r12",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210930-221100053",
        "type": "article",
        "title": "SWALO: scaffolding with assembly likelihood optimization",
        "author": [
            {
                "family_name": "Rahman",
                "given_name": "Atif",
                "orcid": "0000-0003-1805-3971",
                "clpid": "Rahman-Atif"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Scaffolding, i.e. ordering and orienting contigs is an important step in genome assembly. We present a method for scaffolding using second generation sequencing reads based on likelihoods of genome assemblies. A generative model for sequencing is used to obtain maximum likelihood estimates of gaps between contigs and to estimate whether linking contigs into scaffolds would lead to an increase in the likelihood of the assembly. We then link contigs if they can be unambiguously joined or if the corresponding increase in likelihood is substantially greater than that of other possible joins of those contigs. The method is implemented in a tool called SWALO with approximations to make it efficient and applicable to large datasets. Analysis on real and simulated datasets reveals that it consistently makes more or similar number of correct joins as other scaffolders while linking very few contigs incorrectly, thus outperforming other scaffolders and demonstrating that substantial improvement in genome assembly may be achieved through the use of statistical models. SWALO is freely available for download at https://atifrahman.github.io/SWALO/.",
        "doi": "10.1093/nar/gkab717",
        "issn": "0305-1048",
        "publisher": "Oxford University Press",
        "publication": "Nucleic Acids Research",
        "publication_date": "2021-11-18",
        "series_number": "20",
        "volume": "49",
        "issue": "20",
        "pages": "Art. No. e117"
    },
    {
        "id": "authors:6zspp-xxk39",
        "collection": "authors",
        "collection_id": "6zspp-xxk39",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20201027-075126222",
        "type": "article",
        "title": "A multimodal cell census and atlas of the mammalian primary motor cortex",
        "author": [
            {
                "family_name": "Adkins",
                "given_name": "Ricky S.",
                "orcid": "0000-0002-7983-5486",
                "clpid": "Adkins-Ricky-S"
            },
            {
                "family_name": "Aldridge",
                "given_name": "Andrew I.",
                "orcid": "0000-0003-1962-8802"
            },
            {
                "family_name": "Allen",
                "given_name": "Shona",
                "orcid": "0000-0003-0186-0574"
            },
            {
                "family_name": "Ament",
                "given_name": "Seth A.",
                "orcid": "0000-0001-6443-7509"
            },
            {
                "family_name": "An",
                "given_name": "Xu",
                "orcid": "0000-0003-3386-5521"
            },
            {
                "family_name": "Armand",
                "given_name": "Ethan",
                "orcid": "0000-0002-4516-6317"
            },
            {
                "family_name": "Ascoli",
                "given_name": "Giorgio A.",
                "orcid": "0000-0002-0964-676X"
            },
            {
                "family_name": "Bakken",
                "given_name": "Trygve E.",
                "orcid": "0000-0003-3373-7386"
            },
            {
                "family_name": "Bandrowski",
                "given_name": "Anita",
                "orcid": "0000-0002-5497-0243"
            },
            {
                "family_name": "Banerjee",
                "given_name": "Samik",
                "orcid": "0000-0003-2325-1489"
            },
            {
                "family_name": "Barkas",
                "given_name": "Nikolaos",
                "orcid": "0000-0002-4675-0718"
            },
            {
                "family_name": "Bartlett",
                "given_name": "Anna",
                "orcid": "0000-0001-7059-4033"
            },
            {
                "family_name": "Bateup",
                "given_name": "Helen S.",
                "orcid": "0000-0002-0135-0972"
            },
            {
                "family_name": "Behrens",
                "given_name": "M. Margarita",
                "orcid": "0000-0002-7168-8186"
            },
            {
                "family_name": "Berens",
                "given_name": "Philipp",
                "orcid": "0000-0002-0199-4727"
            },
            {
                "family_name": "Berg",
                "given_name": "Jim",
                "orcid": "0000-0002-3300-5399"
            },
            {
                "family_name": "Bernabucci",
                "given_name": "Matteo",
                "orcid": "0000-0003-4458-117X"
            },
            {
                "family_name": "Bernaerts",
                "given_name": "Yves",
                "orcid": "0000-0003-4948-0423"
            },
            {
                "family_name": "Bertagnolli",
                "given_name": "Darren",
                "orcid": "0000-0002-6626-1567"
            },
            {
                "family_name": "Biancalani",
                "given_name": "Tommaso",
                "orcid": "0000-0001-9104-9755"
            },
            {
                "family_name": "Boggeman",
                "given_name": "Lara"
            },
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Bowman",
                "given_name": "Ian",
                "orcid": "0000-0001-7366-9192"
            },
            {
                "family_name": "Bravo",
                "given_name": "H\u00e9ctor Corrada",
                "orcid": "0000-0002-1255-4444"
            },
            {
                "family_name": "Cadwell",
                "given_name": "Cathryn Ren\u00e9",
                "orcid": "0000-0003-1963-8285"
            },
            {
                "family_name": "Callaway",
                "given_name": "Edward M.",
                "orcid": "0000-0002-6366-5267"
            },
            {
                "family_name": "Carlin",
                "given_name": "Benjamin",
                "orcid": "0000-0002-9360-9143"
            },
            {
                "family_name": "O'Connor",
                "given_name": "Carolyn",
                "orcid": "0000-0002-3301-7912"
            },
            {
                "family_name": "Carter",
                "given_name": "Robert",
                "orcid": "0000-0003-0937-8141"
            },
            {
                "family_name": "Casper",
                "given_name": "Tamara",
                "orcid": "0000-0003-1638-3651"
            },
            {
                "family_name": "Castanon",
                "given_name": "Rosa G.",
                "orcid": "0000-0003-1791-002X"
            },
            {
                "family_name": "Castro",
                "given_name": "Jesus Ramon",
                "orcid": "0000-0002-6628-980X"
            },
            {
                "family_name": "Chance",
                "given_name": "Rebecca K.",
                "orcid": "0000-0001-7059-6119"
            },
            {
                "family_name": "Chatterjee",
                "given_name": "Apaala",
                "orcid": "0000-0003-1170-8971"
            },
            {
                "family_name": "Chen",
                "given_name": "Huaming",
                "orcid": "0000-0001-5289-7882"
            },
            {
                "family_name": "Chun",
                "given_name": "Jerold",
                "orcid": "0000-0003-3964-0921"
            },
            {
                "family_name": "Colantuoni",
                "given_name": "Carlo",
                "orcid": "0000-0001-6818-6380"
            },
            {
                "family_name": "Crabtree",
                "given_name": "Jonathan",
                "orcid": "0000-0002-7286-5690"
            },
            {
                "family_name": "Creasy",
                "given_name": "Heather",
                "orcid": "0000-0002-1369-6882"
            },
            {
                "family_name": "Crichton",
                "given_name": "Kirsten",
                "orcid": "0000-0002-7869-1492"
            },
            {
                "family_name": "Crow",
                "given_name": "Megan",
                "orcid": "0000-0002-1172-5897"
            },
            {
                "family_name": "D'Orazi",
                "given_name": "Florence D.",
                "orcid": "0000-0002-7354-4725"
            },
            {
                "family_name": "Daigle",
                "given_name": "Tanya L.",
                "orcid": "0000-0001-9700-8452"
            },
            {
                "family_name": "Dalley",
                "given_name": "Rachel",
                "orcid": "0000-0001-7461-7845"
            },
            {
                "family_name": "Dee",
                "given_name": "Nick",
                "orcid": "0000-0002-2831-9254"
            },
            {
                "family_name": "Degatano",
                "given_name": "Kylee",
                "orcid": "0000-0002-0945-3300"
            },
            {
                "family_name": "Dichter",
                "given_name": "Benjamin",
                "orcid": "0000-0001-5725-6910"
            },
            {
                "family_name": "Diep",
                "given_name": "Dinh",
                "orcid": "0000-0001-6057-4119"
            },
            {
                "family_name": "Ding",
                "given_name": "Liya",
                "orcid": "0000-0002-1209-875X"
            },
            {
                "family_name": "Ding",
                "given_name": "Song-Lin",
                "orcid": "0000-0002-7072-5272"
            },
            {
                "family_name": "Dominguez",
                "given_name": "Bertha",
                "orcid": "0000-0002-9470-7300"
            },
            {
                "family_name": "Dong",
                "given_name": "Hong-Wei",
                "orcid": "0000-0001-9972-3177"
            },
            {
                "family_name": "Dong",
                "given_name": "Weixiu",
                "orcid": "0000-0003-1059-5653"
            },
            {
                "family_name": "Dougherty",
                "given_name": "Elizabeth L.",
                "orcid": "0000-0001-8922-5078"
            },
            {
                "family_name": "Dudoit",
                "given_name": "Sandrine",
                "orcid": "0000-0002-6069-8629"
            },
            {
                "family_name": "Ecker",
                "given_name": "Joseph R.",
                "orcid": "0000-0001-5799-5895"
            },
            {
                "family_name": "Eichhorn",
                "given_name": "Stephen W.",
                "orcid": "0000-0002-6410-4699"
            },
            {
                "family_name": "Fang",
                "given_name": "Rongxin",
                "orcid": "0000-0003-0107-7504"
            },
            {
                "family_name": "Felix",
                "given_name": "Victor",
                "orcid": "0000-0002-9773-0629"
            },
            {
                "family_name": "Feng",
                "given_name": "Guoping",
                "orcid": "0000-0002-8021-277X"
            },
            {
                "family_name": "Feng",
                "given_name": "Zhao",
                "orcid": "0000-0001-5035-7655"
            },
            {
                "family_name": "Fischer",
                "given_name": "Stephan",
                "orcid": "0000-0002-7034-4103"
            },
            {
                "family_name": "Fitzpatrick",
                "given_name": "Conor",
                "orcid": "0000-0003-2625-6277"
            },
            {
                "family_name": "Fong",
                "given_name": "Olivia",
                "orcid": "0000-0002-7091-9667"
            },
            {
                "family_name": "Foster",
                "given_name": "Nicholas N.",
                "orcid": "0000-0003-1740-9788"
            },
            {
                "family_name": "Galbavy",
                "given_name": "William",
                "orcid": "0000-0003-0948-9538"
            },
            {
                "family_name": "Gee",
                "given_name": "James C.",
                "orcid": "0000-0002-2258-0187"
            },
            {
                "family_name": "Ghosh",
                "given_name": "Satrajit S.",
                "orcid": "0000-0002-5312-6729"
            },
            {
                "family_name": "Giglio",
                "given_name": "Michelle",
                "orcid": "0000-0001-7628-5565"
            },
            {
                "family_name": "Gillespie",
                "given_name": "Thomas H.",
                "orcid": "0000-0002-7509-4801"
            },
            {
                "family_name": "Gillis",
                "given_name": "Jesse",
                "orcid": "0000-0002-0936-9774"
            },
            {
                "family_name": "Goldman",
                "given_name": "Melissa",
                "orcid": "0000-0003-1469-5360"
            },
            {
                "family_name": "Goldy",
                "given_name": "Jeff",
                "orcid": "0000-0001-5140-6922"
            },
            {
                "family_name": "Gong",
                "given_name": "Hui",
                "orcid": "0000-0001-5519-6248"
            },
            {
                "family_name": "Gou",
                "given_name": "Lin",
                "orcid": "0000-0002-3109-1879"
            },
            {
                "family_name": "Grauer",
                "given_name": "Michael",
                "orcid": "0000-0002-4167-1076"
            },
            {
                "family_name": "Halchenko",
                "given_name": "Yaroslav O.",
                "orcid": "0000-0003-3456-2493"
            },
            {
                "family_name": "Harris",
                "given_name": "Julie A.",
                "orcid": "0000-0003-0820-2021"
            },
            {
                "family_name": "Hartmanis",
                "given_name": "Leonard",
                "orcid": "0000-0002-4922-8781"
            },
            {
                "family_name": "Hatfield",
                "given_name": "Joshua T.",
                "orcid": "0000-0002-1639-7212"
            },
            {
                "family_name": "Hawrylycz",
                "given_name": "Mike",
                "orcid": "0000-0002-5741-8024"
            },
            {
                "family_name": "Helba",
                "given_name": "Brian",
                "orcid": "0000-0003-2628-805X"
            },
            {
                "family_name": "Herb",
                "given_name": "Brian R.",
                "orcid": "0000-0002-5910-9647"
            },
            {
                "family_name": "Hertzano",
                "given_name": "Ronna",
                "orcid": "0000-0002-8093-6567"
            },
            {
                "family_name": "Hintiryan",
                "given_name": "Houri",
                "orcid": "0000-0002-9721-6785"
            },
            {
                "family_name": "Hirokawa",
                "given_name": "Karla E.",
                "orcid": "0000-0002-9954-5515"
            },
            {
                "family_name": "Hockemeyer",
                "given_name": "Dirk",
                "orcid": "0000-0002-5598-5092"
            },
            {
                "family_name": "Hodge",
                "given_name": "Rebecca D.",
                "orcid": "0000-0002-5784-9668"
            },
            {
                "family_name": "Hood",
                "given_name": "Greg",
                "orcid": "0000-0001-9871-7154"
            },
            {
                "family_name": "Horwitz",
                "given_name": "Gregory D.",
                "orcid": "0000-0001-5130-5259"
            },
            {
                "family_name": "Hou",
                "given_name": "Xiaomeng",
                "orcid": "0000-0002-5453-9015"
            },
            {
                "family_name": "Hu",
                "given_name": "Lijuan",
                "orcid": "0000-0003-1869-0372"
            },
            {
                "family_name": "Hu",
                "given_name": "Qiwen",
                "orcid": "0000-0003-2798-919X"
            },
            {
                "family_name": "Huang",
                "given_name": "Z. Josh",
                "orcid": "0000-0003-0592-028X"
            },
            {
                "family_name": "Huo",
                "given_name": "Bingxing",
                "orcid": "0000-0002-9389-2591"
            },
            {
                "family_name": "Ito-Cole",
                "given_name": "Tony",
                "orcid": "0000-0001-5898-3108"
            },
            {
                "family_name": "Jacobs",
                "given_name": "Matthew",
                "orcid": "0000-0002-3004-8553"
            },
            {
                "family_name": "Jia",
                "given_name": "Xueyan",
                "orcid": "0000-0002-1221-6357"
            },
            {
                "family_name": "Jiang",
                "given_name": "Shengdian",
                "orcid": "0000-0002-2277-263X"
            },
            {
                "family_name": "Jiang",
                "given_name": "Tao",
                "orcid": "0000-0002-4487-299X"
            },
            {
                "family_name": "Jiang",
                "given_name": "Xiaolong",
                "orcid": "0000-0001-8066-1383"
            },
            {
                "family_name": "Jin",
                "given_name": "Xin",
                "orcid": "0000-0002-1106-4013"
            },
            {
                "family_name": "Jorstad",
                "given_name": "Nikolas L.",
                "orcid": "0000-0001-7906-9470"
            },
            {
                "family_name": "Kalmbach",
                "given_name": "Brian E.",
                "orcid": "0000-0003-3136-8097"
            },
            {
                "family_name": "Kancherla",
                "given_name": "Jayaram",
                "orcid": "0000-0001-5855-5031"
            },
            {
                "family_name": "Keene",
                "given_name": "C. Dirk",
                "orcid": "0000-0002-5291-1469"
            },
            {
                "family_name": "Kelly",
                "given_name": "Kathleen",
                "orcid": "0000-0003-2334-9785"
            },
            {
                "family_name": "Khajouei",
                "given_name": "Farzaneh",
                "orcid": "0000-0002-0148-9122"
            },
            {
                "family_name": "Kharchenko",
                "given_name": "Peter V.",
                "orcid": "0000-0002-6036-5875"
            },
            {
                "family_name": "Kim",
                "given_name": "Gukhan",
                "orcid": "0000-0002-3338-5045"
            },
            {
                "family_name": "Ko",
                "given_name": "Andrew L.",
                "orcid": "0000-0002-6253-9891"
            },
            {
                "family_name": "Kobak",
                "given_name": "Dmitry",
                "orcid": "0000-0002-5639-7209"
            },
            {
                "family_name": "Konwar",
                "given_name": "Kishori",
                "orcid": "0000-0001-5152-4777"
            },
            {
                "family_name": "Kramer",
                "given_name": "Daniel J.",
                "orcid": "0000-0003-4241-3586"
            },
            {
                "family_name": "Krienen",
                "given_name": "Fenna M.",
                "orcid": "0000-0002-1400-6820"
            },
            {
                "family_name": "Kroll",
                "given_name": "Matthew",
                "orcid": "0000-0002-0126-7618"
            },
            {
                "family_name": "Kuang",
                "given_name": "Xiuli",
                "orcid": "0000-0001-7569-7605"
            },
            {
                "family_name": "Kuo",
                "given_name": "Hsien-Chi",
                "orcid": "0000-0002-0215-2302"
            },
            {
                "family_name": "Lake",
                "given_name": "Blue B.",
                "orcid": "0000-0002-8637-9044"
            },
            {
                "family_name": "Larsen",
                "given_name": "Rachael",
                "orcid": "0000-0003-0178-003X"
            },
            {
                "family_name": "Lathia",
                "given_name": "Kanan",
                "orcid": "0000-0003-0080-1951"
            },
            {
                "family_name": "Laturnus",
                "given_name": "Sophie",
                "orcid": "0000-0001-9532-788X"
            },
            {
                "family_name": "Lee",
                "given_name": "Angus Y.",
                "orcid": "0000-0002-7649-2705"
            },
            {
                "family_name": "Lee",
                "given_name": "Cheng-Ta",
                "orcid": "0000-0001-6183-2319"
            },
            {
                "family_name": "Lee",
                "given_name": "Kuo-Fen",
                "orcid": "0000-0003-2224-2708"
            },
            {
                "family_name": "Lein",
                "given_name": "Ed S.",
                "orcid": "0000-0001-9012-6552"
            },
            {
                "family_name": "Lesnar",
                "given_name": "Phil",
                "orcid": "0000-0002-2152-604X"
            },
            {
                "family_name": "Li",
                "given_name": "Anan",
                "orcid": "0000-0002-5877-4813"
            },
            {
                "family_name": "Li",
                "given_name": "Xiangning",
                "orcid": "0000-0002-3747-2824"
            },
            {
                "family_name": "Li",
                "given_name": "Xu"
            },
            {
                "family_name": "Li",
                "given_name": "Yang Eric",
                "orcid": "0000-0001-6997-6018"
            },
            {
                "family_name": "Li",
                "given_name": "Yaoyao",
                "orcid": "0000-0001-5468-9876"
            },
            {
                "family_name": "Li",
                "given_name": "Yuanyuan",
                "orcid": "0000-0002-0897-5270"
            },
            {
                "family_name": "Lim",
                "given_name": "Byungkook",
                "orcid": "0000-0002-3766-5415"
            },
            {
                "family_name": "Linnarsson",
                "given_name": "Sten",
                "orcid": "0000-0002-3491-3444"
            },
            {
                "family_name": "Liu",
                "given_name": "Christine S.",
                "orcid": "0000-0002-1239-4612"
            },
            {
                "family_name": "Liu",
                "given_name": "Hanqing",
                "orcid": "0000-0002-5114-6048"
            },
            {
                "family_name": "Liu",
                "given_name": "Lijuan",
                "orcid": "0000-0002-9548-6183"
            },
            {
                "family_name": "Lucero",
                "given_name": "Jacinta D.",
                "orcid": "0000-0001-7578-6624"
            },
            {
                "family_name": "Luo",
                "given_name": "Chongyuan",
                "orcid": "0000-0002-8541-0695"
            },
            {
                "family_name": "Luo",
                "given_name": "Qingming",
                "orcid": "0000-0002-6725-9311"
            },
            {
                "family_name": "Macosko",
                "given_name": "Evan Z.",
                "orcid": "0000-0002-2794-5165"
            },
            {
                "family_name": "Mahurkar",
                "given_name": "Anup",
                "orcid": "0000-0002-4999-2296"
            },
            {
                "family_name": "Martone",
                "given_name": "Maryann E.",
                "orcid": "0000-0002-8406-3871"
            },
            {
                "family_name": "Matho",
                "given_name": "Katherine S.",
                "orcid": "0000-0002-6105-4219"
            },
            {
                "family_name": "McCarroll",
                "given_name": "Steven A.",
                "orcid": "0000-0002-6954-8184"
            },
            {
                "family_name": "McCracken",
                "given_name": "Carrie",
                "orcid": "0000-0002-8038-9727"
            },
            {
                "family_name": "McMillen",
                "given_name": "Delissa",
                "orcid": "0000-0002-3413-4424"
            },
            {
                "family_name": "Miranda",
                "given_name": "Elanine",
                "orcid": "0000-0002-1633-9303"
            },
            {
                "family_name": "Mitra",
                "given_name": "Partha P",
                "orcid": "0000-0001-8818-6804"
            },
            {
                "family_name": "Miyazaki",
                "given_name": "Paula Assakura",
                "orcid": "0000-0003-1295-8710"
            },
            {
                "family_name": "Mizrachi",
                "given_name": "Judith",
                "orcid": "0000-0003-2195-8210"
            },
            {
                "family_name": "Mok",
                "given_name": "Stephanie",
                "orcid": "0000-0002-2688-1569"
            },
            {
                "family_name": "Mukamel",
                "given_name": "Eran A.",
                "orcid": "0000-0003-3203-9535"
            },
            {
                "family_name": "Mulherkar",
                "given_name": "Shalaka",
                "orcid": "0000-0001-8736-527X"
            },
            {
                "family_name": "Nadaf",
                "given_name": "Naeem M.",
                "orcid": "0000-0002-7805-8523"
            },
            {
                "family_name": "Naeemi",
                "given_name": "Maitham",
                "orcid": "0000-0001-9139-3548"
            },
            {
                "family_name": "Narasimhan",
                "given_name": "Arun",
                "orcid": "0000-0002-0246-6301"
            },
            {
                "family_name": "Nery",
                "given_name": "Joseph R.",
                "orcid": "0000-0003-0153-5659"
            },
            {
                "family_name": "Ng",
                "given_name": "Lydia",
                "orcid": "0000-0002-7499-3514"
            },
            {
                "family_name": "Ngai",
                "given_name": "John",
                "orcid": "0000-0002-1191-8971"
            },
            {
                "family_name": "Nguyen",
                "given_name": "Thuc Nghi",
                "orcid": "0000-0002-6466-5883"
            },
            {
                "family_name": "Nickel",
                "given_name": "Lance",
                "orcid": "0000-0002-5836-3571"
            },
            {
                "family_name": "Nicovich",
                "given_name": "Philip R.",
                "orcid": "0000-0002-8517-4469"
            },
            {
                "family_name": "Niu",
                "given_name": "Sheng-Yong",
                "orcid": "0000-0002-7734-1191"
            },
            {
                "family_name": "Ntranos",
                "given_name": "Vasilis",
                "orcid": "0000-0002-2477-0670"
            },
            {
                "family_name": "Nunn",
                "given_name": "Michael",
                "orcid": "0000-0002-6771-9912"
            },
            {
                "family_name": "Olley",
                "given_name": "Dustin",
                "orcid": "0000-0001-8685-0839"
            },
            {
                "family_name": "Orvis",
                "given_name": "Joshua",
                "orcid": "0000-0002-5705-5710"
            },
            {
                "family_name": "Osteen",
                "given_name": "Julia K.",
                "orcid": "0000-0001-7058-3297"
            },
            {
                "family_name": "Osten",
                "given_name": "Pavel",
                "orcid": "0000-0002-6385-7541"
            },
            {
                "family_name": "Owen",
                "given_name": "Scott F.",
                "orcid": "0000-0001-6294-7513"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Palaniswamy",
                "given_name": "Ramesh",
                "orcid": "0000-0003-4322-2407"
            },
            {
                "family_name": "Palmer",
                "given_name": "Carter R.",
                "orcid": "0000-0002-2385-2068"
            },
            {
                "family_name": "Pang",
                "given_name": "Yan",
                "orcid": "0000-0003-3323-5052"
            },
            {
                "family_name": "Peng",
                "given_name": "Hanchuan",
                "orcid": "0000-0002-3478-3942"
            },
            {
                "family_name": "Pham",
                "given_name": "Thanh",
                "orcid": "0000-0002-4738-5062"
            },
            {
                "family_name": "Pinto-Duarte",
                "given_name": "Antonio",
                "orcid": "0000-0002-2215-7653"
            },
            {
                "family_name": "Plongthongkum",
                "given_name": "Nongluk",
                "orcid": "0000-0002-1305-285X"
            },
            {
                "family_name": "Poirion",
                "given_name": "Olivier",
                "orcid": "0000-0002-0429-7003"
            },
            {
                "family_name": "Preissl",
                "given_name": "Sebastian",
                "orcid": "0000-0001-8971-5616"
            },
            {
                "family_name": "Purdom",
                "given_name": "Elizabeth",
                "orcid": "0000-0001-9455-7990"
            },
            {
                "family_name": "Qu",
                "given_name": "Lei",
                "orcid": "0000-0002-2129-5253"
            },
            {
                "family_name": "Rashid",
                "given_name": "Mohammad",
                "orcid": "0000-0002-7884-4954"
            },
            {
                "family_name": "Reed",
                "given_name": "Nora M.",
                "orcid": "0000-0003-0408-1568"
            },
            {
                "family_name": "Regev",
                "given_name": "Aviv",
                "orcid": "0000-0003-3293-3158"
            },
            {
                "family_name": "Ren",
                "given_name": "Bing",
                "orcid": "0000-0002-5435-1127"
            },
            {
                "family_name": "Ren",
                "given_name": "Miao",
                "orcid": "0000-0002-5555-5279"
            },
            {
                "family_name": "Rimorin",
                "given_name": "Christine",
                "orcid": "0000-0003-1491-8552"
            },
            {
                "family_name": "Risso",
                "given_name": "Davide",
                "orcid": "0000-0001-8508-5012"
            },
            {
                "family_name": "Rivkin",
                "given_name": "Angeline C.",
                "orcid": "0000-0003-0399-9043"
            },
            {
                "family_name": "Mu\u00f1oz-Casta\u00f1eda",
                "given_name": "Rodrigo",
                "orcid": "0000-0002-1176-7421"
            },
            {
                "family_name": "Romanow",
                "given_name": "William J.",
                "orcid": "0000-0002-3808-6482"
            },
            {
                "family_name": "Ropelewski",
                "given_name": "Alexander J.",
                "orcid": "0000-0001-6874-4477"
            },
            {
                "family_name": "Roux de B\u00e9zieux",
                "given_name": "Hector",
                "orcid": "0000-0002-1489-8339"
            },
            {
                "family_name": "Ruan",
                "given_name": "Zongcai",
                "orcid": "0000-0003-1547-165X"
            },
            {
                "family_name": "Sandberg",
                "given_name": "Rickard",
                "orcid": "0000-0001-6473-1740"
            },
            {
                "family_name": "Savoia",
                "given_name": "Steven",
                "orcid": "0000-0003-4514-7367"
            },
            {
                "family_name": "Scala",
                "given_name": "Federico",
                "orcid": "0000-0002-2680-8572"
            },
            {
                "family_name": "Schor",
                "given_name": "Michael",
                "orcid": "0000-0002-4493-7992"
            },
            {
                "family_name": "Shen",
                "given_name": "Elise",
                "orcid": "0000-0002-3295-3928"
            },
            {
                "family_name": "Siletti",
                "given_name": "Kimberly",
                "orcid": "0000-0001-7620-8973"
            },
            {
                "family_name": "Smith",
                "given_name": "Jared B.",
                "orcid": "0000-0002-0273-4898"
            },
            {
                "family_name": "Smith",
                "given_name": "Kimberly",
                "orcid": "0000-0002-3142-1970"
            },
            {
                "family_name": "Somasundaram",
                "given_name": "Saroja",
                "orcid": "0000-0002-3729-9849"
            },
            {
                "family_name": "Song",
                "given_name": "Yuanyuan",
                "orcid": "0000-0002-9183-5884"
            },
            {
                "family_name": "Sorensen",
                "given_name": "Staci A.",
                "orcid": "0000-0002-6799-2126"
            },
            {
                "family_name": "Stafford",
                "given_name": "David A.",
                "orcid": "0000-0002-3310-5402"
            },
            {
                "family_name": "Street",
                "given_name": "Kelly",
                "orcid": "0000-0001-6379-5013"
            },
            {
                "family_name": "Sulc",
                "given_name": "Josef",
                "orcid": "0000-0002-4928-7183"
            },
            {
                "family_name": "Sunkin",
                "given_name": "Susan",
                "orcid": "0000-0001-9893-3834"
            },
            {
                "family_name": "Svensson",
                "given_name": "Valentine",
                "orcid": "0000-0002-9217-2330",
                "clpid": "Svensson-Valentine"
            },
            {
                "family_name": "Tan",
                "given_name": "Pengcheng",
                "orcid": "0000-0001-7276-0381"
            },
            {
                "family_name": "Tan",
                "given_name": "Zheng Huan",
                "orcid": "0000-0002-1886-2421"
            },
            {
                "family_name": "Tasic",
                "given_name": "Bosiljka",
                "orcid": "0000-0002-6861-4506"
            },
            {
                "family_name": "Thompson",
                "given_name": "Carol",
                "orcid": "0000-0003-1528-3237"
            },
            {
                "family_name": "Tian",
                "given_name": "Wei",
                "orcid": "0000-0002-2146-1717"
            },
            {
                "family_name": "Tickle",
                "given_name": "Timothy L.",
                "orcid": "0000-0002-6592-6272"
            },
            {
                "family_name": "Tieu",
                "given_name": "Michael",
                "orcid": "0000-0001-9286-5623"
            },
            {
                "family_name": "Ting",
                "given_name": "Jonathan T.",
                "orcid": "0000-0001-8266-0392"
            },
            {
                "family_name": "Tolias",
                "given_name": "Andreas Savas",
                "orcid": "0000-0002-4305-6376"
            },
            {
                "family_name": "Torkelson",
                "given_name": "Amy",
                "orcid": "0000-0002-9465-4202"
            },
            {
                "family_name": "Tung",
                "given_name": "Herman",
                "orcid": "0000-0002-0812-3318"
            },
            {
                "family_name": "Vaishnav",
                "given_name": "Eeshit Dhaval",
                "orcid": "0000-0003-3720-8051"
            },
            {
                "family_name": "Van den Berge",
                "given_name": "Koen",
                "orcid": "0000-0002-1833-8478"
            },
            {
                "family_name": "van Velthoven",
                "given_name": "Cindy T.J.",
                "orcid": "0000-0001-5120-4546"
            },
            {
                "family_name": "Vanderburg",
                "given_name": "Charles R.",
                "orcid": "0000-0001-8979-5054"
            },
            {
                "family_name": "Veldman",
                "given_name": "Matthew B.",
                "orcid": "0000-0002-0328-5916"
            },
            {
                "family_name": "Vu",
                "given_name": "Minh",
                "orcid": "0000-0003-4154-5659"
            },
            {
                "family_name": "Wakeman",
                "given_name": "Wayne",
                "orcid": "0000-0002-3693-3609"
            },
            {
                "family_name": "Wang",
                "given_name": "Peng",
                "orcid": "0000-0003-1181-5558"
            },
            {
                "family_name": "Wang",
                "given_name": "Quanxin",
                "orcid": "0000-0002-0007-7935"
            },
            {
                "family_name": "Wang",
                "given_name": "Xinxin",
                "orcid": "0000-0001-6393-2276"
            },
            {
                "family_name": "Wang",
                "given_name": "Yimin",
                "orcid": "0000-0003-2515-6602"
            },
            {
                "family_name": "Wang",
                "given_name": "Yun",
                "orcid": "0000-0001-5501-8433"
            },
            {
                "family_name": "Welch",
                "given_name": "Joshua D.",
                "orcid": "0000-0002-5869-2391"
            },
            {
                "family_name": "White",
                "given_name": "Owen",
                "orcid": "0000-0003-2407-7320"
            },
            {
                "family_name": "Williams",
                "given_name": "Elora",
                "orcid": "0000-0002-0178-5511"
            },
            {
                "family_name": "Xie",
                "given_name": "Fangming",
                "orcid": "0000-0001-5232-1648"
            },
            {
                "family_name": "Xie",
                "given_name": "Peng",
                "orcid": "0000-0002-9509-7268"
            },
            {
                "family_name": "Xiong",
                "given_name": "Feng",
                "orcid": "0000-0002-6927-8903"
            },
            {
                "family_name": "Yang",
                "given_name": "X. William",
                "orcid": "0000-0003-3705-7935"
            },
            {
                "family_name": "Yanny",
                "given_name": "Anna Marie",
                "orcid": "0000-0001-7250-8450"
            },
            {
                "family_name": "Yao",
                "given_name": "Zizhen",
                "orcid": "0000-0002-9361-5607"
            },
            {
                "family_name": "Yin",
                "given_name": "Lulu",
                "orcid": "0000-0003-2932-6349"
            },
            {
                "family_name": "Yu",
                "given_name": "Yang",
                "orcid": "0000-0002-4340-430X"
            },
            {
                "family_name": "Yuan",
                "given_name": "Jing",
                "orcid": "0000-0001-9050-4496"
            },
            {
                "family_name": "Zeng",
                "given_name": "Hongkui",
                "orcid": "0000-0002-0326-5878"
            },
            {
                "family_name": "Zhang",
                "given_name": "Kun",
                "orcid": "0000-0002-7596-5224"
            },
            {
                "family_name": "Zhang",
                "given_name": "Meng",
                "orcid": "0000-0002-9753-0635"
            },
            {
                "family_name": "Zhang",
                "given_name": "Zhuzhu",
                "orcid": "0000-0002-2661-4700"
            },
            {
                "family_name": "Zhao",
                "given_name": "Sujun",
                "orcid": "0000-0001-7807-7495"
            },
            {
                "family_name": "Zhao",
                "given_name": "Xuan",
                "orcid": "0000-0002-5778-5422"
            },
            {
                "family_name": "Zhou",
                "given_name": "Jingtian",
                "orcid": "0000-0003-2060-1922"
            },
            {
                "family_name": "Zhuang",
                "given_name": "Xiaowei",
                "orcid": "0000-0002-6034-7853"
            },
            {
                "family_name": "Zingg",
                "given_name": "Brian",
                "orcid": "0000-0001-8657-8863"
            },
            {
                "literal": "BRAIN Initiative Cell Census Network (BICCN)"
            }
        ],
        "abstract": "Here we report the generation of a multimodal cell census and atlas of the mammalian primary motor cortex as the initial product of the BRAIN Initiative Cell Census Network (BICCN). This was achieved by coordinated large-scale analyses of single-cell transcriptomes, chromatin accessibility, DNA methylomes, spatially resolved single-cell transcriptomes, morphological and electrophysiological properties and cellular resolution input\u2013output mapping, integrated through cross-modal computational analysis. Our results advance the collective knowledge and understanding of brain cell-type organization. First, our study reveals a unified molecular genetic landscape of cortical cell types that integrates their transcriptome, open chromatin and DNA methylation maps. Second, cross-species analysis achieves a consensus taxonomy of transcriptomic types and their hierarchical organization that is conserved from mouse to marmoset and human. Third, in situ single-cell transcriptomics provides a spatially resolved cell-type atlas of the motor cortex. Fourth, cross-modal analysis provides compelling evidence for the transcriptomic, epigenomic and gene regulatory basis of neuronal phenotypes such as their physiological and anatomical properties, demonstrating the biological validity and genomic underpinning of neuron types. We further present an extensive genetic toolset for targeting glutamatergic neuron types towards linking their molecular and developmental identity to their circuit function. Together, our results establish a unifying and mechanistic framework of neuronal cell-type organization that integrates multi-layered molecular genetic and spatial information with multi-faceted phenotypic properties.",
        "doi": "10.1038/s41586-021-03950-0",
        "issn": "0028-0836",
        "publisher": "Nature Publishing Group",
        "publication": "Nature",
        "publication_date": "2021-10-07",
        "series_number": "7879",
        "volume": "598",
        "issue": "7879",
        "pages": "86-102"
    },
    {
        "id": "authors:13dbk-bh430",
        "collection": "authors",
        "collection_id": "13dbk-bh430",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200303-153620082",
        "type": "article",
        "title": "A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex",
        "author": [
            {
                "family_name": "Yao",
                "given_name": "Zizhen",
                "orcid": "0000-0002-9361-5607",
                "clpid": "Yao-Zizhen"
            },
            {
                "family_name": "Liu",
                "given_name": "Hanqing",
                "orcid": "0000-0002-5114-6048"
            },
            {
                "family_name": "Xie",
                "given_name": "Fangming",
                "orcid": "0000-0001-5232-1648"
            },
            {
                "family_name": "Fischer",
                "given_name": "Stephan",
                "orcid": "0000-0002-7034-4103"
            },
            {
                "family_name": "Adkins",
                "given_name": "Ricky S.",
                "orcid": "0000-0002-7983-5486"
            },
            {
                "family_name": "Aldrige",
                "given_name": "Andrew I.",
                "orcid": "0000-0003-1962-8802",
                "clpid": "Aldrige-Andrew-I"
            },
            {
                "family_name": "Ament",
                "given_name": "Seth A.",
                "orcid": "0000-0001-6443-7509"
            },
            {
                "family_name": "Bartlett",
                "given_name": "Anna",
                "orcid": "0000-0001-7059-4033"
            },
            {
                "family_name": "Behrens",
                "given_name": "M. Margarita",
                "orcid": "0000-0002-7168-8186"
            },
            {
                "family_name": "Van den Berge",
                "given_name": "Koen",
                "orcid": "0000-0002-1833-8478"
            },
            {
                "family_name": "Bertagnolli",
                "given_name": "Darren",
                "orcid": "0000-0002-6626-1567"
            },
            {
                "family_name": "de Bezieux",
                "given_name": "Hector Roux",
                "orcid": "0000-0002-1489-8339"
            },
            {
                "family_name": "Biancalani",
                "given_name": "Tommaso",
                "orcid": "0000-0001-9104-9755"
            },
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Corrada Bravo",
                "given_name": "Hector",
                "orcid": "0000-0002-1255-4444"
            },
            {
                "family_name": "Casper",
                "given_name": "Tamara",
                "orcid": "0000-0003-1638-3651"
            },
            {
                "family_name": "Colantuoni",
                "given_name": "Carlo",
                "orcid": "0000-0001-6818-6380"
            },
            {
                "family_name": "Crabtree",
                "given_name": "Jonathan",
                "orcid": "0000-0002-7286-5690"
            },
            {
                "family_name": "Creasy",
                "given_name": "Heather",
                "orcid": "0000-0002-1369-6882"
            },
            {
                "family_name": "Crichton",
                "given_name": "Kirsten",
                "orcid": "0000-0002-7869-1492"
            },
            {
                "family_name": "Crow",
                "given_name": "Megan",
                "orcid": "0000-0002-1172-5897"
            },
            {
                "family_name": "Dee",
                "given_name": "Nick",
                "orcid": "0000-0002-2831-9254"
            },
            {
                "family_name": "Dougherty",
                "given_name": "Elizabeth L.",
                "orcid": "0000-0001-8922-5078"
            },
            {
                "family_name": "Doyle",
                "given_name": "Wayne I.",
                "orcid": "0000-0001-8276-2591"
            },
            {
                "family_name": "Dudoit",
                "given_name": "Sandrine",
                "orcid": "0000-0002-6069-8629"
            },
            {
                "family_name": "Fang",
                "given_name": "Rongxin",
                "orcid": "0000-0003-0107-7504"
            },
            {
                "family_name": "Felix",
                "given_name": "Victor",
                "orcid": "0000-0002-9773-0629"
            },
            {
                "family_name": "Fong",
                "given_name": "Olivia",
                "orcid": "0000-0002-7091-9667"
            },
            {
                "family_name": "Giglio",
                "given_name": "Michelle",
                "orcid": "0000-0001-7628-5565"
            },
            {
                "family_name": "Goldy",
                "given_name": "Jeff",
                "orcid": "0000-0001-5140-6922"
            },
            {
                "family_name": "Hawrylycz",
                "given_name": "Michael",
                "orcid": "0000-0002-5741-8024"
            },
            {
                "family_name": "Herb",
                "given_name": "Brian R.",
                "orcid": "0000-0002-5910-9647"
            },
            {
                "family_name": "Hertzano",
                "given_name": "Ronna",
                "orcid": "0000-0002-8093-6567"
            },
            {
                "family_name": "Hou",
                "given_name": "Xiaomeng",
                "orcid": "0000-0002-5453-9015"
            },
            {
                "family_name": "Hu",
                "given_name": "Qiwen",
                "orcid": "0000-0003-2798-919X"
            },
            {
                "family_name": "Kancherla",
                "given_name": "Jayaram",
                "orcid": "0000-0001-5855-5031"
            },
            {
                "family_name": "Kroll",
                "given_name": "Matthew",
                "orcid": "0000-0002-0126-7618"
            },
            {
                "family_name": "Lathia",
                "given_name": "Kanan",
                "orcid": "0000-0003-0080-1951"
            },
            {
                "family_name": "Li",
                "given_name": "Yang Eric",
                "orcid": "0000-0001-6997-6018"
            },
            {
                "family_name": "Lucero",
                "given_name": "Jacinta D.",
                "orcid": "0000-0001-7578-6624"
            },
            {
                "family_name": "Luo",
                "given_name": "Chongyuan",
                "orcid": "0000-0002-8541-0695"
            },
            {
                "family_name": "Mahurkar",
                "given_name": "Anup",
                "orcid": "0000-0002-4999-2296"
            },
            {
                "family_name": "McMillen",
                "given_name": "Delissa",
                "orcid": "0000-0002-3413-4424"
            },
            {
                "family_name": "Nadaf",
                "given_name": "Naeem M.",
                "orcid": "0000-0002-7805-8523"
            },
            {
                "family_name": "Nery",
                "given_name": "Joseph R.",
                "orcid": "0000-0003-0153-5659"
            },
            {
                "family_name": "Nguyen",
                "given_name": "Thuc Nghi",
                "orcid": "0000-0002-6466-5883"
            },
            {
                "family_name": "Niu",
                "given_name": "Sheng-Yong",
                "orcid": "0000-0002-7734-1191"
            },
            {
                "family_name": "Ntranos",
                "given_name": "Vasilis",
                "orcid": "0000-0002-2477-0670"
            },
            {
                "family_name": "Orvis",
                "given_name": "Joshua",
                "orcid": "0000-0002-5705-5710"
            },
            {
                "family_name": "Osteen",
                "given_name": "Julia K.",
                "orcid": "0000-0001-7058-3297"
            },
            {
                "family_name": "Pham",
                "given_name": "Thanh",
                "orcid": "0000-0002-4738-5062"
            },
            {
                "family_name": "Pinto-Duarte",
                "given_name": "Antonio",
                "orcid": "0000-0002-2215-7653"
            },
            {
                "family_name": "Poirion",
                "given_name": "Olivier",
                "orcid": "0000-0002-0429-7003"
            },
            {
                "family_name": "Preissl",
                "given_name": "Sebastian",
                "orcid": "0000-0001-8971-5616"
            },
            {
                "family_name": "Purdom",
                "given_name": "Elizabeth",
                "orcid": "0000-0001-9455-7990"
            },
            {
                "family_name": "Rimorin",
                "given_name": "Christine",
                "orcid": "0000-0003-1491-8552"
            },
            {
                "family_name": "Risso",
                "given_name": "Davide",
                "orcid": "0000-0001-8508-5012"
            },
            {
                "family_name": "Rivkin",
                "given_name": "Angeline C.",
                "orcid": "0000-0003-0399-9043",
                "clpid": "Rivkin-Angeline-C"
            },
            {
                "family_name": "Smith",
                "given_name": "Kimberly",
                "orcid": "0000-0002-3142-1970"
            },
            {
                "family_name": "Street",
                "given_name": "Kelly",
                "orcid": "0000-0001-6379-5013"
            },
            {
                "family_name": "Sulc",
                "given_name": "Josef",
                "orcid": "0000-0002-4928-7183"
            },
            {
                "family_name": "Svensson",
                "given_name": "Valentine",
                "orcid": "0000-0002-9217-2330",
                "clpid": "Svensson-Valentine"
            },
            {
                "family_name": "Tieu",
                "given_name": "Michael",
                "orcid": "0000-0001-9286-5623"
            },
            {
                "family_name": "Torkelson",
                "given_name": "Amy",
                "orcid": "0000-0002-9465-4202"
            },
            {
                "family_name": "Tung",
                "given_name": "Herman",
                "orcid": "0000-0002-0812-3318"
            },
            {
                "family_name": "Vaishnav",
                "given_name": "Eeshit Dhaval",
                "orcid": "0000-0003-3720-8051"
            },
            {
                "family_name": "Vanderburg",
                "given_name": "Charles R.",
                "orcid": "0000-0001-8979-5054"
            },
            {
                "family_name": "van Velthoven",
                "given_name": "Cindy",
                "orcid": "0000-0001-5120-4546"
            },
            {
                "family_name": "Wang",
                "given_name": "Xinxin",
                "orcid": "0000-0001-6393-2276"
            },
            {
                "family_name": "White",
                "given_name": "Owen R.",
                "orcid": "0000-0003-2407-7320"
            },
            {
                "family_name": "Huang",
                "given_name": "Z. Josh",
                "orcid": "0000-0003-0592-028X"
            },
            {
                "family_name": "Kharchenko",
                "given_name": "Peter V.",
                "orcid": "0000-0002-6036-5875"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Ngai",
                "given_name": "John",
                "orcid": "0000-0002-1191-8971"
            },
            {
                "family_name": "Regev",
                "given_name": "Aviv",
                "orcid": "0000-0003-3293-3158"
            },
            {
                "family_name": "Tasic",
                "given_name": "Bosiljka",
                "orcid": "0000-0002-6861-4506"
            },
            {
                "family_name": "Welch",
                "given_name": "Joshua D.",
                "orcid": "0000-0002-5869-2391"
            },
            {
                "family_name": "Gillis",
                "given_name": "Jesse",
                "orcid": "0000-0002-0936-9774"
            },
            {
                "family_name": "Macosko",
                "given_name": "Evan Z.",
                "orcid": "0000-0002-2794-5165"
            },
            {
                "family_name": "Ren",
                "given_name": "Bing",
                "orcid": "0000-0002-5435-1127"
            },
            {
                "family_name": "Ecker",
                "given_name": "Joseph R.",
                "orcid": "0000-0001-5799-5895",
                "clpid": "Ecker-Joseph-R"
            },
            {
                "family_name": "Zeng",
                "given_name": "Hongkui",
                "orcid": "0000-0002-0326-5878"
            },
            {
                "family_name": "Mukamel",
                "given_name": "Eran A.",
                "orcid": "0000-0003-3203-9535"
            },
            {
                "literal": "BRAIN Initiative Cell Census Network (BICCN)"
            }
        ],
        "abstract": "Single-cell transcriptomics can provide quantitative molecular signatures for large, unbiased samples of the diverse cell types in the brain. With the proliferation of multi-omics datasets, a major challenge is to validate and integrate results into a biological understanding of cell-type organization. Here we generated transcriptomes and epigenomes from more than 500,000 individual cells in the mouse primary motor cortex, a structure that has an evolutionarily conserved role in locomotion. We developed computational and statistical methods to integrate multimodal data and quantitatively validate cell-type reproducibility. The resulting reference atlas\u2014containing over 56 neuronal cell types that are highly replicable across analysis methods, sequencing technologies and modalities\u2014is a comprehensive molecular and genomic account of the diverse neuronal and non-neuronal cell types in the mouse primary motor cortex. The atlas includes a population of excitatory neurons that resemble pyramidal cells in layer 4 in other cortical regions. We further discovered thousands of concordant marker genes and gene regulatory elements for these cell types. Our results highlight the complex molecular regulation of cell types in the brain and will directly enable the design of reagents to target specific cell types in the mouse primary motor cortex for functional analysis.",
        "doi": "10.1038/s41586-021-03500-8",
        "pmcid": "PMC8494649",
        "issn": "0028-0836",
        "publisher": "Nature Publishing Group",
        "publication": "Nature",
        "publication_date": "2021-10-07",
        "series_number": "7879",
        "volume": "598",
        "issue": "7879",
        "pages": "103-110"
    },
    {
        "id": "authors:mb2qc-b8r09",
        "collection": "authors",
        "collection_id": "mb2qc-b8r09",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200306-130944112",
        "type": "article",
        "title": "Isoform cell-type specificity in the mouse primary motor cortex",
        "author": [
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Yao",
                "given_name": "Zizhen",
                "orcid": "0000-0002-9361-5607",
                "clpid": "Yao-Zizhen"
            },
            {
                "family_name": "van Velthoven",
                "given_name": "Cindy",
                "orcid": "0000-0001-5120-4546",
                "clpid": "van-Velthoven-Cindy"
            },
            {
                "family_name": "Smith",
                "given_name": "Kimberly",
                "orcid": "0000-0002-3142-1970",
                "clpid": "Smith-Kimberly"
            },
            {
                "family_name": "Tasic",
                "given_name": "Bosiljka",
                "orcid": "0000-0002-6861-4506",
                "clpid": "Tasic-Bosiljka"
            },
            {
                "family_name": "Zeng",
                "given_name": "Hongkui",
                "orcid": "0000-0002-0326-5878",
                "clpid": "Zeng-Hongkui"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Full-length SMART-seq single-cell RNA sequencing can be used to measure gene expression at isoform resolution, making possible the identification of specific isoform markers for different cell types. Used in conjunction with spatial RNA capture and gene-tagging methods, this enables the inference of spatially resolved isoform expression for different cell types. Here, in a comprehensive analysis of 6,160 mouse primary motor cortex cells assayed with SMART-seq, 280,327 cells assayed with MERFISH and 94,162 cells assayed with 10x Genomics sequencing3, we find examples of isoform specificity in cell types\u2014including isoform shifts between cell types that are masked in gene-level analysis\u2014as well as examples of transcriptional regulation. Additionally, we show that isoform specificity helps to refine cell types, and that a multi-platform analysis of single-cell transcriptomic data leveraging multiple measurements provides a comprehensive atlas of transcription in the mouse primary motor cortex that improves on the possibilities offered by any single technology.",
        "doi": "10.1038/s41586-021-03969-3",
        "pmcid": "PMC8494650",
        "issn": "0028-0836",
        "publisher": "Nature Publishing Group",
        "publication": "Nature",
        "publication_date": "2021-10-07",
        "series_number": "7879",
        "volume": "598",
        "issue": "7879",
        "pages": "195-199"
    },
    {
        "id": "authors:06pjn-hnt60",
        "collection": "authors",
        "collection_id": "06pjn-hnt60",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210129-070700034",
        "type": "article",
        "title": "Low-cost, scalable, and automated fluid sampling for fluidics applications",
        "author": [
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Kil",
                "given_name": "Yeokyoung",
                "orcid": "0000-0002-1235-7379",
                "clpid": "Kil-Yeokyoung"
            },
            {
                "family_name": "Min",
                "given_name": "Kyung Hoi",
                "orcid": "0000-0003-0894-4017",
                "clpid": "Min-Kyung-Hoi"
            },
            {
                "family_name": "Gehring",
                "given_name": "Jase",
                "orcid": "0000-0002-3894-9495",
                "clpid": "Gehring-Jase"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We present colosseum, a low-cost, modular, and automated fluid sampling device for scalable fluidic applications. The colosseum fraction collector uses a single motor, can be built for less than $100 using off-the-shelf and 3D-printed components, and can be assembled in less than an hour. Build Instructions and source files are available at https://doi.org/10.5281/zenodo.4677604.",
        "doi": "10.1016/j.ohx.2021.e00201",
        "issn": "2468-0672",
        "publisher": "Elsevier",
        "publication": "HardwareX",
        "publication_date": "2021-10",
        "volume": "10",
        "pages": "Art. No. e00201"
    },
    {
        "id": "authors:jp119-47j39",
        "collection": "authors",
        "collection_id": "jp119-47j39",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200520-084505912",
        "type": "article",
        "title": "Normalization of single-cell RNA-seq counts by log(x+1)* or log(1+x)*",
        "author": [
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Single-cell RNA-seq technologies have been successfully employed over the past decade to generate many high resolution cell atlases. These have proved invaluable in recent efforts aimed at understanding the cell type specificity of host genes involved in SARS-CoV-2 infections. While single-cell atlases are based on well-sampled highly-expressed genes, many of the genes of interest for understanding SARS-CoV-2 can be expressed at very low levels. Common assumptions underlying standard single-cell analyses don't hold when examining low-expressed genes, with the result that standard workflows can produce misleading results.",
        "doi": "10.1093/bioinformatics/btab085",
        "pmcid": "PMC7989636",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2021-08-01",
        "series_number": "15",
        "volume": "37",
        "issue": "15",
        "pages": "2223-2224"
    },
    {
        "id": "authors:05kjw-6t056",
        "collection": "authors",
        "collection_id": "05kjw-6t056",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20201119-132151980",
        "type": "article",
        "title": "Massively scaled-up testing for SARS-CoV-2 RNA via next-generation sequencing of pooled and barcoded nasal and saliva samples",
        "author": [
            {
                "family_name": "Bloom",
                "given_name": "Joshua S.",
                "orcid": "0000-0002-7241-1648",
                "clpid": "Bloom-Joshua-S"
            },
            {
                "family_name": "Sathe",
                "given_name": "Laila",
                "orcid": "0000-0003-1016-3295",
                "clpid": "Sathe-Laila"
            },
            {
                "family_name": "Munugala",
                "given_name": "Chetan",
                "clpid": "Munugala-Chetan"
            },
            {
                "family_name": "Jones",
                "given_name": "Eric M.",
                "clpid": "Jones-Eric-M"
            },
            {
                "family_name": "Gasperini",
                "given_name": "Molly",
                "orcid": "0000-0003-4559-8432",
                "clpid": "Gasperini-Molly"
            },
            {
                "family_name": "Lubock",
                "given_name": "Nathan B.",
                "orcid": "0000-0001-8064-2465",
                "clpid": "Lubock-Nathan-B"
            },
            {
                "family_name": "Yarza",
                "given_name": "Fauna",
                "orcid": "0000-0002-2512-6182",
                "clpid": "Yarza-Fauna"
            },
            {
                "family_name": "Thompson",
                "given_name": "Erin M.",
                "orcid": "0000-0002-6085-3051",
                "clpid": "Thompson-Erin-M"
            },
            {
                "family_name": "Kovary",
                "given_name": "Kyle M.",
                "orcid": "0000-0002-7616-2968",
                "clpid": "Kovary-Kyle-M"
            },
            {
                "family_name": "Park",
                "given_name": "Jimin",
                "clpid": "Park-Jimin"
            },
            {
                "family_name": "Marquette",
                "given_name": "Dawn",
                "orcid": "0000-0002-3964-7683",
                "clpid": "Marquette-Dawn"
            },
            {
                "family_name": "Kay",
                "given_name": "Stephania",
                "clpid": "Kay-Stephania"
            },
            {
                "family_name": "Lucas",
                "given_name": "Mark",
                "clpid": "Lucas-Mark"
            },
            {
                "family_name": "Love",
                "given_name": "TreQuan",
                "clpid": "Love-TreQuan"
            },
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Brandenberg",
                "given_name": "Oliver F.",
                "orcid": "0000-0001-5662-1234",
                "clpid": "Brandenberg-Oliver-F"
            },
            {
                "family_name": "Guo",
                "given_name": "Longhua",
                "orcid": "0000-0001-9690-9750",
                "clpid": "Guo-Longhua"
            },
            {
                "family_name": "Boocock",
                "given_name": "James",
                "orcid": "0000-0003-0323-8818",
                "clpid": "Boocock-James"
            },
            {
                "family_name": "Hochman",
                "given_name": "Myles",
                "orcid": "0000-0001-5172-6395",
                "clpid": "Hochman-Myles"
            },
            {
                "family_name": "Simpkins",
                "given_name": "Scott W.",
                "orcid": "0000-0002-5997-2838",
                "clpid": "Simpkins-Scott-W"
            },
            {
                "family_name": "Lin",
                "given_name": "Isabella",
                "orcid": "0000-0002-7102-6879",
                "clpid": "Lin-Isabella"
            },
            {
                "family_name": "LaPierre",
                "given_name": "Nathan",
                "orcid": "0000-0003-2394-8868",
                "clpid": "LaPierre-Nathan"
            },
            {
                "family_name": "Hong",
                "given_name": "Duke",
                "clpid": "Hong-Duke"
            },
            {
                "family_name": "Zhang",
                "given_name": "Yi",
                "clpid": "Zhang-Yi"
            },
            {
                "family_name": "Oland",
                "given_name": "Gabriel",
                "orcid": "0000-0002-6941-3060",
                "clpid": "Oland-Gabriel"
            },
            {
                "family_name": "Choe",
                "given_name": "Bianca Judy",
                "clpid": "Choe-Bianca-Judy"
            },
            {
                "family_name": "Chandrasekaran",
                "given_name": "Sukantha",
                "orcid": "0000-0002-6232-5535",
                "clpid": "Chandrasekaran-Sukantha"
            },
            {
                "family_name": "Hilt",
                "given_name": "Evann E.",
                "clpid": "Hilt-Evann-E"
            },
            {
                "family_name": "Butte",
                "given_name": "Manish J.",
                "orcid": "0000-0002-4490-5595",
                "clpid": "Butte-Manish-J"
            },
            {
                "family_name": "Damoiseaux",
                "given_name": "Robert",
                "orcid": "0000-0002-7611-7534",
                "clpid": "Damoiseaux-Robert"
            },
            {
                "family_name": "Kravit",
                "given_name": "Clifford",
                "orcid": "0000-0002-0624-5514",
                "clpid": "Kravit-Clifford"
            },
            {
                "family_name": "Cooper",
                "given_name": "Aaron R.",
                "orcid": "0000-0003-4588-2513",
                "clpid": "Cooper-Aaron-R"
            },
            {
                "family_name": "Yin",
                "given_name": "Yi",
                "orcid": "0000-0003-0963-2672",
                "clpid": "Yin-Yi"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Garner",
                "given_name": "Omai B.",
                "orcid": "0000-0002-7366-2692",
                "clpid": "Garner-Omai-B"
            },
            {
                "family_name": "Flint",
                "given_name": "Jonathan",
                "orcid": "0000-0002-9427-4429",
                "clpid": "Flint-Jonathan"
            },
            {
                "family_name": "Eskin",
                "given_name": "Eleazar",
                "orcid": "0000-0003-1149-4758",
                "clpid": "Eskin-Eleazar"
            },
            {
                "family_name": "Luo",
                "given_name": "Chongyuan",
                "orcid": "0000-0002-8541-0695",
                "clpid": "Luo-Chongyuan"
            },
            {
                "family_name": "Kosuri",
                "given_name": "Sriram",
                "orcid": "0000-0002-4661-0600",
                "clpid": "Kosuri-Sriram"
            },
            {
                "family_name": "Kruglyak",
                "given_name": "Leonid",
                "orcid": "0000-0002-8065-3057",
                "clpid": "Kruglyak-Leonid"
            },
            {
                "family_name": "Arboleda",
                "given_name": "Valerie A.",
                "orcid": "0000-0002-9687-9122",
                "clpid": "Arboleda-Valerie-A"
            }
        ],
        "abstract": "Frequent and widespread testing of members of the population who are asymptomatic for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is essential for the mitigation of the transmission of the virus. Despite the recent increases in testing capacity, tests based on quantitative polymerase chain reaction (qPCR) assays cannot be easily deployed at the scale required for population-wide screening. Here, we show that next-generation sequencing of pooled samples tagged with sample-specific molecular barcodes enables the testing of thousands of nasal or saliva samples for SARS-CoV-2 RNA in a single run without the need for RNA extraction. The assay, which we named SwabSeq, incorporates a synthetic RNA standard that facilitates end-point quantification and the calling of true negatives, and that reduces the requirements for automation, purification and sample-to-sample normalization. We used SwabSeq to perform 80,000 tests, with an analytical sensitivity and specificity comparable to or better than traditional qPCR tests, in less than two months with turnaround times of less than 24\u2009h. SwabSeq could be rapidly adapted for the detection of other pathogens.",
        "doi": "10.1038/s41551-021-00754-5",
        "pmcid": "PMC7480060",
        "issn": "2157-846X",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Biomedical Engineering",
        "publication_date": "2021-07",
        "series_number": "7",
        "volume": "5",
        "issue": "7",
        "pages": "657-665"
    },
    {
        "id": "authors:363j8-nw138",
        "collection": "authors",
        "collection_id": "363j8-nw138",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210405-142728694",
        "type": "article",
        "title": "Modular, efficient and constant-memory single-cell RNA-seq preprocessing",
        "author": [
            {
                "family_name": "Melsted",
                "given_name": "P\u00e1ll",
                "orcid": "0000-0002-8418-6724",
                "clpid": "Melsted-P\u00e1ll"
            },
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Liu",
                "given_name": "Lauren",
                "clpid": "Liu-Lauren"
            },
            {
                "family_name": "Gao",
                "given_name": "Fan",
                "clpid": "Gao-Fan"
            },
            {
                "family_name": "Lu",
                "given_name": "Lambda",
                "orcid": "0000-0002-7092-9427",
                "clpid": "Lu-Lambda"
            },
            {
                "family_name": "Min",
                "given_name": "Kyung Hoi",
                "orcid": "0000-0003-0894-4017",
                "clpid": "Min-Kyung-Hoi"
            },
            {
                "family_name": "da Veiga Beltrame",
                "given_name": "Eduardo",
                "orcid": "0000-0002-1529-9207",
                "clpid": "da-Veiga-Beltrame-Eduardo"
            },
            {
                "family_name": "Hjorleifsson",
                "given_name": "Kristj\u00e1n Eldj\u00e1rn",
                "orcid": "0000-0002-7851-1818",
                "clpid": "Hjorleifsson-Kristj\u00e1n-E"
            },
            {
                "family_name": "Gehring",
                "given_name": "Jase",
                "orcid": "0000-0002-3894-9495",
                "clpid": "Gehring-Jase"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We describe a workflow for preprocessing of single-cell RNA-sequencing data that balances efficiency and accuracy. Our workflow is based on the kallisto and bustools programs, and is near optimal in speed with a constant memory requirement providing scalability for arbitrarily large datasets. The workflow is modular, and we demonstrate its flexibility by showing how it can be used for RNA velocity analyses.",
        "doi": "10.1038/s41587-021-00870-2",
        "issn": "1087-0156",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Biotechnology",
        "publication_date": "2021-07",
        "series_number": "7",
        "volume": "39",
        "issue": "7",
        "pages": "813-818"
    },
    {
        "id": "authors:anq5b-hzp82",
        "collection": "authors",
        "collection_id": "anq5b-hzp82",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200707-114817234",
        "type": "article",
        "title": "BUTTERFLY: addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq",
        "author": [
            {
                "family_name": "Gustafsson",
                "given_name": "Johan",
                "orcid": "0000-0001-5072-2659",
                "clpid": "Gustafsson-Johan"
            },
            {
                "family_name": "Robinson",
                "given_name": "Jonathan",
                "orcid": "0000-0001-8567-5960",
                "clpid": "Robinson-Jonathan"
            },
            {
                "family_name": "Nielsen",
                "given_name": "Jens",
                "orcid": "0000-0002-9955-6003",
                "clpid": "Nielsen-Jens"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The incorporation of unique molecular identifiers (UMIs) in single-cell RNA-seq assays makes possible the identification of duplicated molecules, thereby facilitating the counting of distinct molecules from sequenced reads. However, we show that the na\u00efve removal of duplicates can lead to a bias due to a \"pooled amplification paradox,\" and we propose an improved quantification method based on unseen species modeling. Our correction called BUTTERFLY uses a zero truncated negative binomial estimator implemented in the kallisto bustools workflow. We demonstrate its efficacy across cell types and genes and show that in some cases it can invert the relative abundance of genes.",
        "doi": "10.1186/s13059-021-02386-z",
        "pmcid": "PMC8188791",
        "issn": "1474-760X",
        "publisher": "BioMed Central",
        "publication": "Genome Biology",
        "publication_date": "2021-06-08",
        "volume": "22",
        "pages": "Art. No. 174"
    },
    {
        "id": "authors:t6026-vtf40",
        "collection": "authors",
        "collection_id": "t6026-vtf40",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210503-100056268",
        "type": "article",
        "title": "Analysis of Length Biases in Single-Cell RNA Sequencing of Unspliced mRNA by Markov Modeling",
        "author": [
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-Gennady"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Recent experimental advances in single-cell RNA sequencing (scRNA-seq) have enabled the quantification of transcriptomes with single-molecule resolution. However, thus far, the stochastic modeling of transcription has been separate from the discussion of the statistics of the sequencing process, leading to simplifications that may obfuscate transcriptional dynamics, and technical artifacts in the assays. For example, imputation, normalization, and smoothing, used to correct for stochastic sequencing phenomena, make experimental molecule count data incompatible with a discrete representation, thus rendering the data uninterpretable in the context of conventional Chemical Master Equation (CME) models. Models of gene expression - such as the negative binomial count model - are used with limited physical justification, whereas models for multimodal data are under-explored. Conversely, more detailed CME descriptions of gene expression do not directly address the complexities of the sequencing process. We demonstrate that modeling both phenomena reveals a pervasive gene length-based effect in the detection of unspliced mRNA: long genes are substantially more likely to have higher average unspliced mRNA expression. To explain this effect, we build a stochastic model that accounts for physiological and experimental events, and jointly infer hundreds of gene-specific as well as transcriptome-wide parameters. Specifically, we extend a joint model of mRNA processing described by Singh and Bokes (Biophys. J., 2012) to incorporate downstream Poisson sampling, representing cDNA library construction and sequencing. The explicit inclusion of sampling yields mechanistically interpretable results for the gene expression parameters, and suggests extensions to more complex models.",
        "doi": "10.1016/j.bpj.2020.11.706",
        "issn": "0006-3495",
        "publisher": "Biophysical Society",
        "publication": "Biophysical Journal",
        "publication_date": "2021-02-12",
        "series_number": "3",
        "volume": "120",
        "issue": "3",
        "pages": "81A"
    },
    {
        "id": "authors:8h643-s8r26",
        "collection": "authors",
        "collection_id": "8h643-s8r26",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210503-102227319",
        "type": "article",
        "title": "Learning the Dynamics of Bursty Transcription and Splicing using Ultra-Fast Parameter Inference and New Analytical Solutions of the Chemical Master Equation",
        "author": [
            {
                "family_name": "Vastola",
                "given_name": "John J.",
                "clpid": "Vastola-John-J"
            },
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-Gennady"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Holmes",
                "given_name": "William R.",
                "clpid": "Holmes-William-R"
            }
        ],
        "abstract": "Single cell RNA counts data is increasingly available, and can in principle be used to extract mechanistic insight about transcription and splicing dynamics. In order to infer numbers related to processes of biophysical interest---for example, splicing rates, RNA production rates, RNA degradation rates, and the number of splicing steps involved in processing some particular kind of RNA---it is necessary to compare the predictions of quantitative models with counts data. In practice, this involves generating model predictions for an enormous number of parameter sets, and using some measure of goodness of fit to determine reasonable parameter ranges; because this procedure tends to be extremely computationally expensive, one can typically fit only very simple models involving a small state space and small number of parameters. We report on a new approach to fitting the dynamics of bursty transcription and splicing, which uses newly derived analytical solutions to the chemical master equation to greatly speed up parameter inference. The associated speedup, which we have found on simulated counts data to be many orders of magnitude in some cases, comes from not using stochastic simulations or numerical approaches like finite state projection, but the aforementioned closed-form mathematical formulas. Our approach applies to models of splicing involving arbitrarily many splicing steps, introns that can be removed in an arbitrary order, and arbitrarily many downstream alternatively spliced variants. Moreover, it scales extremely well as one's splicing model gets increasingly complicated (e.g. more splicing steps, more alternative splicing branches). We comment on some of the issues associated with using these algorithms to learn parameters from real counts data, including identifiability problems.",
        "doi": "10.1016/j.bpj.2020.11.1018",
        "issn": "0006-3495",
        "publisher": "Biophysical Society",
        "publication": "Biophysical Journal",
        "publication_date": "2021-02-12",
        "series_number": "3",
        "volume": "120",
        "issue": "3",
        "pages": "135A"
    },
    {
        "id": "authors:r5anw-4eh31",
        "collection": "authors",
        "collection_id": "r5anw-4eh31",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200601-101849395",
        "type": "article",
        "title": "Reliable and accurate diagnostics from highly multiplexed sequencing assays",
        "author": [
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-Sina"
            },
            {
                "family_name": "Lubock",
                "given_name": "Nathan B.",
                "orcid": "0000-0001-8064-2465",
                "clpid": "Lubock-Nathan-B"
            },
            {
                "family_name": "Cooper",
                "given_name": "Aaron R.",
                "orcid": "0000-0003-4588-2513",
                "clpid": "Cooper-Aaron-R"
            },
            {
                "family_name": "Simpkins",
                "given_name": "Scott W.",
                "orcid": "0000-0002-5997-2838",
                "clpid": "Simpkins-Scott-W"
            },
            {
                "family_name": "Bloom",
                "given_name": "Joshua S.",
                "orcid": "0000-0002-7241-1648",
                "clpid": "Bloom-Joshua-S"
            },
            {
                "family_name": "Gehring",
                "given_name": "Jase",
                "orcid": "0000-0002-3894-9495",
                "clpid": "Gehring-Jase"
            },
            {
                "family_name": "Luebbert",
                "given_name": "Laura",
                "orcid": "0000-0003-1379-2927",
                "clpid": "Luebbert-Laura"
            },
            {
                "family_name": "Kosuri",
                "given_name": "Sriram",
                "orcid": "0000-0002-4661-0600",
                "clpid": "Kosuri-Sriram"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Scalable, inexpensive, and secure testing for SARS-CoV-2 infection is crucial for control of the novel coronavirus pandemic. Recently developed highly multiplexed sequencing assays (HMSAs) that rely on high-throughput sequencing can, in principle, meet these demands, and present promising alternatives to currently used RT-qPCR-based tests. However, reliable analysis, interpretation, and clinical use of HMSAs requires overcoming several computational, statistical and engineering challenges. Using recently acquired experimental data, we present and validate a computational workflow based on kallisto and bustools, that utilizes robust statistical methods and fast, memory efficient algorithms, to quickly, accurately and reliably process high-throughput sequencing data. We show that our workflow is effective at processing data from all recently proposed SARS-CoV-2 sequencing based diagnostic tests, and is generally applicable to any diagnostic HMSA.",
        "doi": "10.1038/s41598-020-78942-7",
        "pmcid": "PMC7730459",
        "issn": "2045-2322",
        "publisher": "Nature Publishing Group",
        "publication": "Scientific Reports",
        "publication_date": "2020-12-10",
        "volume": "10",
        "pages": "Art. No. 21759"
    },
    {
        "id": "authors:fvv0w-xy358",
        "collection": "authors",
        "collection_id": "fvv0w-xy358",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190821-092511308",
        "type": "article",
        "title": "A curated database reveals trends in single cell transcriptomics",
        "author": [
            {
                "family_name": "Svensson",
                "given_name": "Valentine",
                "orcid": "0000-0002-9217-2330",
                "clpid": "Svensson-Valentine"
            },
            {
                "family_name": "da Veiga Beltrame",
                "given_name": "Eduardo",
                "orcid": "0000-0002-1529-9207",
                "clpid": "da-Veiga-Beltrame-Eduardo"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The more than 1000 single-cell transcriptomics studies that have been published to date constitute a valuable and vast resource for biological discovery. While various 'atlas' projects have collated some of the associated datasets, most questions related to specific tissue types, species or other attributes of studies require identifying papers through manual and challenging literature search. To facilitate discovery with published single-cell transcriptomics data, we have assembled a near exhaustive, manually curated database of single-cell transcriptomics studies with key information: descriptions of the type of data and technologies used, along with descriptors of the biological systems studied. Additionally, the database contains summarized information about analysis in the papers, allowing for analysis of trends in the field. As an example, we show that the number of cell types identified in scRNA-seq studies is proportional to the number of cells analysed.",
        "doi": "10.1093/database/baaa073",
        "pmcid": "PMC7698659",
        "issn": "1758-0463",
        "publisher": "Oxford University Press",
        "publication": "Database: The Journal of Biological Databases and Curation",
        "publication_date": "2020-11-28",
        "volume": "2020",
        "pages": "Art. No. baaa073"
    },
    {
        "id": "authors:bkjds-vsh60",
        "collection": "authors",
        "collection_id": "bkjds-vsh60",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20210309-074448590",
        "type": "article",
        "title": "A faster implementation of association mapping from k-mers",
        "author": [
            {
                "family_name": "Mehrab",
                "given_name": "Zakaria",
                "clpid": "Mehrab-Zakaria"
            },
            {
                "family_name": "Mobin",
                "given_name": "Jaiaid",
                "clpid": "Mobin-Jaiaid"
            },
            {
                "family_name": "Tahmid",
                "given_name": "Ibrahim Asadullah",
                "clpid": "Tahmid-Ibrahim-Asadullah"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Rahman",
                "given_name": "Atif",
                "orcid": "0000-0003-1805-3971",
                "clpid": "Rahman-Atif"
            }
        ],
        "abstract": "Association mapping is the process of linking phenotypes with genotypes. In genome wide association studies (GWAS), individuals are first genotyped using microarrays or by aligning sequenced reads to reference genomes. However, both these approaches rely on reference genomes which limits their application to organisms with no or incomplete reference genomes. To address this, reference free association mapping methods have been developed. Here we present the protocol of an alignment free method for association studies which is based on counting k-mers in sequenced reads, testing for associations between k-mers and the phenotype of interest, and local assembly of the k-mers of statistical significance. The method can map associations of categorical phenotypes to sequence and structural variations without requiring prior sequencing of reference genomes.",
        "doi": "10.21769/bioprotoc.3815",
        "issn": "2331-8325",
        "publisher": "Bio-Protocol",
        "publication": "Bio-protocol",
        "publication_date": "2020-11-05",
        "series_number": "21",
        "volume": "10",
        "issue": "21",
        "pages": "Art. No. e3815"
    },
    {
        "id": "authors:ampr3-ja254",
        "collection": "authors",
        "collection_id": "ampr3-ja254",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200909-153753998",
        "type": "article",
        "title": "Special function methods for bursty models of transcription",
        "author": [
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-G"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We explore a Markov model used in the analysis of gene expression, involving the bursty production of pre-mRNA, its conversion to mature mRNA, and its consequent degradation. We demonstrate that the integration used to compute the solution of the stochastic system can be approximated by the evaluation of special functions. Furthermore, the form of the special function solution generalizes to a broader class of burst distributions. In light of the broader goal of biophysical parameter inference from transcriptomics data, we apply the method to simulated data, demonstrating effective control of precision and runtime. Finally, we propose and validate a non-Bayesian approach for parameter estimation based on the characteristic function of the target joint distribution of pre-mRNA and mRNA.",
        "doi": "10.1103/physreve.102.022409",
        "issn": "2470-0045",
        "publisher": "American Physical Society",
        "publication": "Physical Review E",
        "publication_date": "2020-08",
        "series_number": "2",
        "volume": "102",
        "issue": "2",
        "pages": "Art. No. 022409"
    },
    {
        "id": "authors:nn6dw-xb480",
        "collection": "authors",
        "collection_id": "nn6dw-xb480",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20191125-141648000",
        "type": "article",
        "title": "Odd-paired is a pioneer-like factor that coordinates with Zelda to control gene expression in embryos",
        "author": [
            {
                "family_name": "Koromila",
                "given_name": "Theodora",
                "orcid": "0000-0001-5504-1369",
                "clpid": "Koromila-Theodora"
            },
            {
                "family_name": "Gao",
                "given_name": "Fan",
                "clpid": "Gao-Fan"
            },
            {
                "family_name": "Iwasaki",
                "given_name": "Yasuno",
                "clpid": "Iwasaki-Yasuno"
            },
            {
                "family_name": "He",
                "given_name": "Peng",
                "clpid": "He-Peng"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Gergen",
                "given_name": "J. Peter",
                "clpid": "Gergen-J-Peter"
            },
            {
                "family_name": "Stathopoulos",
                "given_name": "Angelike",
                "orcid": "0000-0001-6597-2036",
                "clpid": "Stathopoulos-A"
            }
        ],
        "abstract": "Pioneer factors such as Zelda (Zld) help initiate zygotic transcription in Drosophila early embryos, but whether other factors support this dynamic process is unclear. Odd-paired (Opa), a zinc-finger transcription factor expressed at cellularization, controls the transition of genes from pair-rule to segmental patterns along the anterior-posterior axis. Finding that Opa also regulates expression through enhancer sog_Distal along the dorso-ventral axis, we hypothesized Opa's role is more general. Chromatin-immunoprecipitation (ChIP-seq) confirmed its in vivo binding to sog_Distal but also identified widespread binding throughout the genome, comparable to Zld. Furthermore, chromatin assays (ATAC-seq) demonstrate that Opa, like Zld, influences chromatin accessibility genome-wide at cellularization, suggesting both are pioneer factors with common as well as distinct targets. Lastly, embryos lacking opa exhibit widespread, late patterning defects spanning both axes. Collectively, these data suggest Opa is a general timing factor and likely late-acting pioneer factor that drives a secondary wave of zygotic gene expression.",
        "doi": "10.7554/eLife.59610",
        "pmcid": "PMC7417190",
        "issn": "2050-084X",
        "publisher": "eLife Sciences Publications",
        "publication": "eLife",
        "publication_date": "2020-07-23",
        "volume": "9",
        "pages": "Art. No. e59610"
    },
    {
        "id": "authors:zc0e4-j1959",
        "collection": "authors",
        "collection_id": "zc0e4-j1959",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200602-124021279",
        "type": "article",
        "title": "RefShannon: A genome-guided transcriptome assembler using sparse flow decomposition",
        "author": [
            {
                "family_name": "Mao",
                "given_name": "Shunfu",
                "orcid": "0000-0002-8203-0507",
                "clpid": "Mao-Shunfu"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Tse",
                "given_name": "David",
                "clpid": "Tse-David-N"
            },
            {
                "family_name": "Kannan",
                "given_name": "Sreeram",
                "clpid": "Kannan-S"
            }
        ],
        "abstract": "High throughput sequencing of RNA (RNA-Seq) has become a staple in modern molecular biology, with applications not only in quantifying gene expression but also in isoform-level analysis of the RNA transcripts. To enable such an isoform-level analysis, a transcriptome assembly algorithm is utilized to stitch together the observed short reads into the corresponding transcripts. This task is complicated due to the complexity of alternative splicing - a mechanism by which the same gene may generate multiple distinct RNA transcripts. We develop a novel genome-guided transcriptome assembler, RefShannon, that exploits the varying abundances of the different transcripts, in enabling an accurate reconstruction of the transcripts. Our evaluation shows RefShannon is able to improve sensitivity effectively (up to 22%) at a given specificity in comparison with other state-of-the-art assemblers. RefShannon is written in Python and is available from Github (https://github.com/shunfumao/RefShannon).",
        "doi": "10.1371/journal.pone.0232946",
        "pmcid": "PMC7266320",
        "issn": "1932-6203",
        "publisher": "Public Library of Science",
        "publication": "PLoS ONE",
        "publication_date": "2020-06-02",
        "series_number": "6",
        "volume": "15",
        "issue": "6",
        "pages": "Art. No. e0232946"
    },
    {
        "id": "authors:2pyfk-v8764",
        "collection": "authors",
        "collection_id": "2pyfk-v8764",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190816-135915873",
        "type": "article",
        "title": "Interpretable factor models of single-cell RNA-seq via variational autoencoders",
        "author": [
            {
                "family_name": "Svensson",
                "given_name": "Valentine",
                "orcid": "0000-0002-9217-2330",
                "clpid": "Svensson-Valentine"
            },
            {
                "family_name": "Gayoso",
                "given_name": "Adam",
                "orcid": "0000-0001-9537-0845",
                "clpid": "Gayoso-Adam"
            },
            {
                "family_name": "Yosef",
                "given_name": "Nir",
                "clpid": "Yosef-Nir"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Motivation: Single-cell RNA-seq makes possible the investigation of variability in gene expression among cells, and dependence of variation on cell type. Statistical inference methods for such analyses must be scalable, and ideally interpretable. \n\nResults: We present an approach based on a modification of a recently published highly scalable variational autoencoder framework that provides interpretability without sacrificing much accuracy. We demonstrate that our approach enables identification of gene programs in massive datasets. Our strategy, namely the learning of factor models with the auto-encoding variational Bayes framework, is not domain specific and may be useful for other applications. \n\nAvailability and implementation: The factor model is available in the scVI package hosted at https://github.com/YosefLab/scVI/.",
        "doi": "10.1093/bioinformatics/btaa169",
        "pmcid": "PMC7267837",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2020-06",
        "series_number": "11",
        "volume": "36",
        "issue": "11",
        "pages": "3418-3421"
    },
    {
        "id": "authors:akk2e-ngb36",
        "collection": "authors",
        "collection_id": "akk2e-ngb36",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190607-122759859",
        "type": "article",
        "title": "RNA velocity and protein acceleration from single-cell multiomics experiments",
        "author": [
            {
                "family_name": "Gorin",
                "given_name": "Gennady",
                "orcid": "0000-0001-6097-2029",
                "clpid": "Gorin-G"
            },
            {
                "family_name": "Svensson",
                "given_name": "Valentine",
                "orcid": "0000-0002-9217-2330",
                "clpid": "Svensson-V"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The simultaneous quantification of protein and RNA makes possible the inference of past, present, and future cell states from single experimental snapshots. To enable such temporal analysis from multimodal single-cell experiments, we introduce an extension of the RNA velocity method that leverages estimates of unprocessed transcript and protein abundances to extrapolate cell states. We apply the model to six datasets and demonstrate consistency among cell landscapes and phase portraits. The analysis software is available as the protaccel Python package.",
        "doi": "10.1186/s13059-020-1945-3",
        "pmcid": "PMC7029606",
        "issn": "1465-6906",
        "publisher": "BioMed Central",
        "publication": "Genome Biology",
        "publication_date": "2020-02-18",
        "volume": "21",
        "pages": "Art. No. 39"
    },
    {
        "id": "authors:m07dw-3jq86",
        "collection": "authors",
        "collection_id": "m07dw-3jq86",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20181030-145533155",
        "type": "article",
        "title": "Highly multiplexed single-cell RNA-seq by DNA oligonucleotide tagging of cellular proteins",
        "author": [
            {
                "family_name": "Gehring",
                "given_name": "Jase",
                "orcid": "0000-0002-3894-9495",
                "clpid": "Gehring-J"
            },
            {
                "family_name": "Park",
                "given_name": "Jong Hwee",
                "clpid": "Park-Jong-Hwee"
            },
            {
                "family_name": "Chen",
                "given_name": "Sisi",
                "orcid": "0000-0001-9448-9713",
                "clpid": "Chen-Sisi"
            },
            {
                "family_name": "Thomson",
                "given_name": "Matthew",
                "orcid": "0000-0003-1021-1234",
                "clpid": "Thomson-M-W"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We describe a universal sample multiplexing method for single-cell RNA sequencing in which fixed cells are chemically labeled by attaching identifying DNA oligonucleotides to cellular proteins. Analysis of a 96-plex perturbation experiment revealed changes in cell population structure and transcriptional states that cannot be discerned from bulk measurements, establishing an efficient method for surveying cell populations from large experiments or clinical samples with the depth and resolution of single-cell RNA sequencing.",
        "doi": "10.1038/s41587-019-0372-z",
        "issn": "1087-0156",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Biotechnology",
        "publication_date": "2020-01",
        "series_number": "1",
        "volume": "38",
        "issue": "1",
        "pages": "35-38"
    },
    {
        "id": "authors:x39ya-h3a68",
        "collection": "authors",
        "collection_id": "x39ya-h3a68",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200106-081503232",
        "type": "article",
        "title": "The BUS Format for Single-Cell RNA-Seq Processing and Analysis",
        "author": [
            {
                "family_name": "Gao",
                "given_name": "Fan",
                "clpid": "Gao-Fan"
            },
            {
                "family_name": "da Veiga Beltrame",
                "given_name": "Eduardo",
                "orcid": "0000-0002-1529-9207",
                "clpid": "da-Veiga-Beltrame-E"
            },
            {
                "family_name": "Gehring",
                "given_name": "Jase A.",
                "clpid": "Gehring-J-A"
            },
            {
                "family_name": "Hjoerleifsson",
                "given_name": "Kristin E. Edljarn",
                "clpid": "Hjoerleifsson-K-E-E"
            },
            {
                "family_name": "Lu",
                "given_name": "Lambda",
                "orcid": "0000-0002-7092-9427",
                "clpid": "Lu-Lambda"
            },
            {
                "family_name": "Melsted",
                "given_name": "Paull",
                "orcid": "0000-0002-8418-6724",
                "clpid": "Melsted-P"
            },
            {
                "family_name": "Ntranos",
                "given_name": "Vasilis",
                "orcid": "0000-0002-2477-0670",
                "clpid": "Ntranos-V"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Svensson",
                "given_name": "Valentine",
                "orcid": "0000-0002-9217-2330",
                "clpid": "Svensson-V"
            }
        ],
        "abstract": "The Barcode-UMI-Set format (BUS) is a recently developed format for representing pseudoalignments of reads from single-cell RNA-seq experiments. The format can be used with most single-cell RNA-seq technologies, can be generated efficiently, and allows for development of modular and robust workflows for processing and analysis of single-cell RNA-seq reads. To demonstrate the utility of BUS, we processed 381,992,071 single-cell RNA-Seq reads from a 1:1 mixture of fresh frozen human cells (HEK293T) and mouse cells (NIH3T3) produced with 10x technology and hosted on the 10x Genomics website. The generation of BUS format using a new command in the kallisto program took 984 seconds for this data (in comparison with 55,745 seconds with the 10x Genomics CellRanger software). I will present results showing that this workflow not only produces comparable results to the existing standard workflow, but is flexible and useful for many other applications.",
        "pmcid": "PMC6938108",
        "issn": "1524-0215",
        "publisher": "Association of Biomolecular Resource Facilities",
        "publication": "Journal of Biomolecular Techniques",
        "publication_date": "2019-12",
        "series_number": "S1",
        "volume": "30",
        "issue": "S1",
        "pages": "S62"
    },
    {
        "id": "authors:jk166-h3s87",
        "collection": "authors",
        "collection_id": "jk166-h3s87",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20181128-093526289",
        "type": "article",
        "title": "Barcode, UMI, Set format and BUStools",
        "author": [
            {
                "family_name": "Melsted",
                "given_name": "P\u00e1ll",
                "orcid": "0000-0002-8418-6724",
                "clpid": "Melsted-P"
            },
            {
                "family_name": "Ntranos",
                "given_name": "Vasilis",
                "orcid": "0000-0002-2477-0670",
                "clpid": "Ntranos-V"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We introduce the Barcode-UMI-Set format (BUS) for representing pseudoalignments of reads from single-cell RNA-seq experiments. The format can be used with all single-cell RNA-seq technologies, and we show that BUS files can be efficiently generated. BUStools is a suite of tools for working with BUS files and facilitates rapid quantification and analysis of single-cell RNA-seq data. The BUS format therefore makes possible the development of modular, technology-specific and robust workflows for single-cell RNA-seq analysis.",
        "doi": "10.1093/bioinformatics/btz279",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2019-11-01",
        "series_number": "21",
        "volume": "35",
        "issue": "21",
        "pages": "4472-4473"
    },
    {
        "id": "authors:e3g77-5qr43",
        "collection": "authors",
        "collection_id": "e3g77-5qr43",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20191017-094121433",
        "type": "article",
        "title": "Multimodal Analysis of Cell Types in a Hypothalamic Node Controlling Social Behavior",
        "author": [
            {
                "family_name": "Kim",
                "given_name": "Dong-Wook",
                "orcid": "0000-0002-5497-5853",
                "clpid": "Kim-Dong-Wook"
            },
            {
                "family_name": "Yao",
                "given_name": "Zizhen",
                "orcid": "0000-0002-9361-5607",
                "clpid": "Yao-Zizhen"
            },
            {
                "family_name": "Graybuck",
                "given_name": "Lucas T.",
                "orcid": "0000-0002-8814-6818",
                "clpid": "Graybuck-Lucas-T"
            },
            {
                "family_name": "Kim",
                "given_name": "Tae Kyung",
                "clpid": "Kim-Tae-Kyung"
            },
            {
                "family_name": "Nguyen",
                "given_name": "Thuc Nghi",
                "clpid": "Nguyen-Thuc-Nghi"
            },
            {
                "family_name": "Smith",
                "given_name": "Kimberly A.",
                "clpid": "Smith-Kimberly-A"
            },
            {
                "family_name": "Fong",
                "given_name": "Olivia",
                "clpid": "Fong-Olivia"
            },
            {
                "family_name": "Yi",
                "given_name": "Lynn",
                "orcid": "0000-0003-4575-0158",
                "clpid": "Yi-Lynn"
            },
            {
                "family_name": "Koulena",
                "given_name": "Noushin",
                "orcid": "0000-0002-9419-5712",
                "clpid": "Koulena-Noushin"
            },
            {
                "family_name": "Pierson",
                "given_name": "Nico",
                "orcid": "0000-0002-2451-0633",
                "clpid": "Pierson-Nico-G"
            },
            {
                "family_name": "Shah",
                "given_name": "Sheel",
                "clpid": "Shah-Sheel"
            },
            {
                "family_name": "Lo",
                "given_name": "Liching",
                "clpid": "Lo-Liching"
            },
            {
                "family_name": "Pool",
                "given_name": "Allan-Hermann",
                "orcid": "0000-0002-0811-9861",
                "clpid": "Pool-Allan-Hermann"
            },
            {
                "family_name": "Oka",
                "given_name": "Yuki",
                "orcid": "0000-0003-2686-0677",
                "clpid": "Oka-Yuki"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Cai",
                "given_name": "Long",
                "orcid": "0000-0002-7154-5361",
                "clpid": "Cai-Long"
            },
            {
                "family_name": "Tasic",
                "given_name": "Bosiljka",
                "orcid": "0000-0002-6861-4506",
                "clpid": "Tasic-Bosiljka"
            },
            {
                "family_name": "Zeng",
                "given_name": "Hongkui",
                "orcid": "0000-0002-0326-5878",
                "clpid": "Zeng-Hongkui"
            },
            {
                "family_name": "Anderson",
                "given_name": "David J.",
                "orcid": "0000-0001-6175-3872",
                "clpid": "Anderson-D-J"
            }
        ],
        "abstract": "The ventrolateral subdivision of the ventromedial hypothalamus (VMHvl) contains \u223c4,000 neurons that project to multiple targets and control innate social behaviors including aggression and mounting. However, the number of cell types in VMHvl and their relationship to connectivity and behavioral function are unknown. We performed single-cell RNA sequencing using two independent platforms\u2014SMART-seq (\u223c4,500 neurons) and 10x (\u223c78,000 neurons)\u2014and investigated correspondence between transcriptomic identity and axonal projections or behavioral activation, respectively. Canonical correlation analysis (CCA) identified 17 transcriptomic types (T-types), including several sexually dimorphic clusters, the majority of which were validated by seqFISH. Immediate early gene analysis identified T-types exhibiting preferential responses to intruder males versus females but only rare examples of behavior-specific activation. Unexpectedly, many VMHvl T-types comprise a mixed population of neurons with different projection target preferences. Overall our analysis revealed that, surprisingly, few VMHvl T-types exhibit a clear correspondence with behavior-specific activation and connectivity.",
        "doi": "10.1016/j.cell.2019.09.020",
        "pmcid": "PMC7534821",
        "issn": "0092-8674",
        "publisher": "Cell Press",
        "publication": "Cell",
        "publication_date": "2019-10-17",
        "series_number": "3",
        "volume": "179",
        "issue": "3",
        "pages": "713-728"
    },
    {
        "id": "authors:gyqah-rvc67",
        "collection": "authors",
        "collection_id": "gyqah-rvc67",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200206-130336877",
        "type": "article",
        "title": "Investigating the Post-Partum Flare in Rheumatoid Arthritis Using Transcriptome Analysis",
        "author": [
            {
                "family_name": "Wright",
                "given_name": "Matthew",
                "clpid": "Wright-M"
            },
            {
                "family_name": "Goin",
                "given_name": "Dana",
                "clpid": "Goin-D-E"
            },
            {
                "family_name": "Smed",
                "given_name": "Mette",
                "clpid": "Smed-M-K"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Nelson",
                "given_name": "J. Lee",
                "clpid": "Nelson-J-L"
            },
            {
                "family_name": "Jewell",
                "given_name": "Nicholas",
                "clpid": "Jewell-N-P"
            },
            {
                "family_name": "Olsen",
                "given_name": "J\u00f8rn",
                "clpid": "Olsen-J"
            },
            {
                "family_name": "Hetland",
                "given_name": "Merete Lund",
                "clpid": "Hetland-M-L"
            },
            {
                "family_name": "Zoffmann",
                "given_name": "Vibeke",
                "clpid": "Zoffmann-V"
            },
            {
                "family_name": "Jawaheer",
                "given_name": "Damini",
                "clpid": "Jawaheer-D"
            }
        ],
        "abstract": "Women with Rheumatoid arthritis (RA) tend to have a predictable fl are of disease activity in the months after childbirth. The mechanism(s) underlying this post-partum fl are are as yet unknown. Using our pregnancy cohort, we (a) examined gene expression changes associated with a fl are of RA disease activity post- partum, (b) determined how those changes compare to post- partum changes observed among healthy women, and (c) examined whether expression profi les by 3 months post- partum differed from those before pregnancy.",
        "doi": "10.1002/art.41108",
        "issn": "2326-5191",
        "publisher": "Wiley",
        "publication": "Arthritis and Rheumatology",
        "publication_date": "2019-10",
        "series_number": "S10",
        "volume": "71",
        "issue": "S10",
        "pages": "Art. No. 1940"
    },
    {
        "id": "authors:rcg7y-rez41",
        "collection": "authors",
        "collection_id": "rcg7y-rez41",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20200206-125453251",
        "type": "article",
        "title": "The Pre-pregnancy Rheumatoid Arthritis Gene Expression Signature Correlates with Improvement or Worsening of Disease Activity During Pregnancy: A Pilot Study",
        "author": [
            {
                "family_name": "Pathi",
                "given_name": "Amogh",
                "clpid": "Pathi-A"
            },
            {
                "family_name": "Smed",
                "given_name": "Mette",
                "clpid": "Smed-M-K"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Purdom",
                "given_name": "Elizabeth",
                "clpid": "Purdom-E"
            },
            {
                "family_name": "Wright",
                "given_name": "Matthew",
                "clpid": "Wright-M"
            },
            {
                "family_name": "Jewell",
                "given_name": "Nicholas",
                "clpid": "Jewell-N-P"
            },
            {
                "family_name": "Nelson",
                "given_name": "J. Lee",
                "clpid": "Nelson-J-L"
            },
            {
                "family_name": "Olsen",
                "given_name": "J\u00f8rn",
                "clpid": "Olsen-J"
            },
            {
                "family_name": "Hetland",
                "given_name": "Merete Lund",
                "clpid": "Hetland-M-L"
            },
            {
                "family_name": "Zoffmann",
                "given_name": "Vibeke",
                "clpid": "Zoffmann-V"
            },
            {
                "family_name": "Jawaheer",
                "given_name": "Damini",
                "clpid": "Jawaheer-D"
            }
        ],
        "abstract": "Pregnancy is known to induce a natural improvement of Rheumatoid Arthritis (RA) symptoms in 50- 75% of patients as gestation progresses. However, the underlying mechanisms are not well understood and no biomarkers have been identified that predict whether a woman will improve or worsen during pregnancy. In this study, we aimed to identify RA- associated pre- pregnancy gene expression signatures to determine if they correlated with the subsequent improvement or worsening of RA during pregnancy.",
        "doi": "10.1002/art.41108",
        "issn": "2326-5191",
        "publisher": "Wiley",
        "publication": "Arthritis and Rheumatology",
        "publication_date": "2019-10",
        "series_number": "S10",
        "volume": "71",
        "issue": "S10",
        "pages": "Art. No. 1938"
    },
    {
        "id": "authors:b2hv9-6t971",
        "collection": "authors",
        "collection_id": "b2hv9-6t971",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190610-075805993",
        "type": "article",
        "title": "Factor analysis for survival time prediction with informative censoring and diverse covariates",
        "author": [
            {
                "family_name": "McCurdy",
                "given_name": "Shannon",
                "orcid": "0000-0001-5555-4156",
                "clpid": "McCurdy-S-R"
            },
            {
                "family_name": "Molinaro",
                "given_name": "Annette",
                "orcid": "0000-0002-9854-7404",
                "clpid": "Molinaro-A-M"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Fulfilling the promise of precision medicine requires accurately and precisely classifying disease states. For cancer, this includes prediction of survival time from a surfeit of covariates. Such data presents an opportunity for improved prediction, but also a challenge due to high dimensionality. Furthermore, disease populations can be heterogeneous. Integrative modeling is sensible, as the underlying hypothesis is that joint analysis of multiple covariates provides greater explanatory power than separate analyses. We propose an integrative latent variable model that combines factor analysis for various data types and an exponential proportional hazards (EPH) model for continuous survival time with informative censoring. The factor and EPH models are connected through low\u2010dimensional latent variables that can be interpreted and visualized to identify subpopulations. We use this model to predict survival time. We demonstrate this model's utility in simulation and on four Cancer Genome Atlas datasets: diffuse lower\u2010grade glioma, glioblastoma multiforme, lung adenocarcinoma, and lung squamous cell carcinoma. These datasets have small sample sizes, high\u2010dimensional diverse covariates, and high censorship rates. We compare the predictions from our model to three alternative models. Our model outperforms in simulation and is competitive on real datasets. Furthermore, the low\u2010dimensional visualization for diffuse lower\u2010grade glioma displays known subpopulations.",
        "doi": "10.1002/sim.8151",
        "issn": "0277-6715",
        "publisher": "Wiley",
        "publication": "Statistics in Medicine",
        "publication_date": "2019-09-10",
        "series_number": "20",
        "volume": "38",
        "issue": "20",
        "pages": "3719-3732"
    },
    {
        "id": "authors:f2956-nyy34",
        "collection": "authors",
        "collection_id": "f2956-nyy34",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190827-103540654",
        "type": "article",
        "title": "Principles of open source bioinstrumentation applied to the poseidon syringe pump system",
        "author": [
            {
                "family_name": "Booeshaghi",
                "given_name": "A. Sina",
                "orcid": "0000-0002-6442-4502",
                "clpid": "Booeshaghi-A-S"
            },
            {
                "family_name": "da Veiga Beltrame",
                "given_name": "Eduardo",
                "orcid": "0000-0002-1529-9207",
                "clpid": "da-Veiga-Beltrame-E"
            },
            {
                "family_name": "Bannon",
                "given_name": "Dylan",
                "clpid": "Bannon-D"
            },
            {
                "family_name": "Gehring",
                "given_name": "Jase",
                "orcid": "0000-0002-3894-9495",
                "clpid": "Gehring-J"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The poseidon syringe pump and microscope system is an open source alternative to commercial systems. It costs less than $400 and can be assembled in under an hour using the instructions and source files available at https://pachterlab.github.io/poseidon. We describe the poseidon system and use it to illustrate design principles that can facilitate the adoption and development of open source bioinstruments. The principles are functionality, robustness, safety, simplicity, modularity, benchmarking, and documentation.",
        "doi": "10.1038/s41598-019-48815-9",
        "pmcid": "PMC6711986",
        "issn": "2045-2322",
        "publisher": "Nature Publishing Group",
        "publication": "Scientific Reports",
        "publication_date": "2019-08-27",
        "volume": "9",
        "pages": "12385"
    },
    {
        "id": "authors:cpbyk-8yg76",
        "collection": "authors",
        "collection_id": "cpbyk-8yg76",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190123-095919155",
        "type": "article",
        "title": "A discriminative learning approach to differential expression analysis for single-cell RNA-seq",
        "author": [
            {
                "family_name": "Ntranos",
                "given_name": "Vasilis",
                "orcid": "0000-0002-2477-0670",
                "clpid": "Ntranos-V"
            },
            {
                "family_name": "Yi",
                "given_name": "Lynn",
                "orcid": "0000-0003-4575-0158",
                "clpid": "Yi-Lynn"
            },
            {
                "family_name": "Melsted",
                "given_name": "P\u00e1ll",
                "orcid": "0000-0002-8418-6724",
                "clpid": "Melsted-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Single-cell RNA-seq makes it possible to characterize the transcriptomes of cell types across different conditions and to identify their transcriptional signatures via differential analysis. Our method detects changes in transcript dynamics and in overall gene abundance in large numbers of cells to determine differential expression. When applied to transcript compatibility counts obtained via pseudoalignment, our approach provides a quantification-free analysis of 3\u2032 single-cell RNA-seq that can identify previously undetectable marker genes.",
        "doi": "10.1038/s41592-018-0303-9",
        "issn": "1548-7091",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Methods",
        "publication_date": "2019-02",
        "series_number": "2",
        "volume": "16",
        "issue": "2",
        "pages": "163-166"
    },
    {
        "id": "authors:26j43-gs111",
        "collection": "authors",
        "collection_id": "26j43-gs111",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20181029-133340286",
        "type": "article",
        "title": "Deterministic column subset selection for single-cell RNA-Seq",
        "author": [
            {
                "family_name": "McCurdy",
                "given_name": "Shannon R.",
                "orcid": "0000-0001-5555-4156",
                "clpid": "McCurdy-S-R"
            },
            {
                "family_name": "Ntranos",
                "given_name": "Vasilis",
                "orcid": "0000-0002-2477-0670",
                "clpid": "Ntranos-V"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Analysis of single-cell RNA sequencing (scRNA-Seq) data often involves filtering out uninteresting or poorly measured genes and dimensionality reduction to reduce noise and simplify data visualization. However, techniques such as principal components analysis (PCA) fail to preserve non-negativity and sparsity structures present in the original matrices, and the coordinates of projected cells are not easily interpretable. Commonly used thresholding methods to filter genes avoid those pitfalls, but ignore collinearity and covariance in the original matrix. We show that a deterministic column subset selection (DCSS) method possesses many of the favorable properties of common thresholding methods and PCA, while avoiding pitfalls from both. We derive new spectral bounds for DCSS. We apply DCSS to two measures of gene expression from two scRNA-Seq experiments with different clustering workflows, and compare to three thresholding methods. In each case study, the clusters based on the small subset of the complete gene expression profile selected by DCSS are similar to clusters produced from the full set. The resulting clusters are informative for cell type.",
        "doi": "10.1371/journal.pone.0210571",
        "pmcid": "PMC6347249",
        "issn": "1932-6203",
        "publisher": "Public Library of Science",
        "publication": "PLoS ONE",
        "publication_date": "2019-01-25",
        "series_number": "1",
        "volume": "14",
        "issue": "1",
        "pages": "Art. No. e0210571"
    },
    {
        "id": "authors:a5jaw-w3h87",
        "collection": "authors",
        "collection_id": "a5jaw-w3h87",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20181029-144423877",
        "type": "article",
        "title": "Barcode identification for single cell genomics",
        "author": [
            {
                "family_name": "Tambe",
                "given_name": "Akshay",
                "clpid": "Tambe-Akshay"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Background: Single-cell sequencing experiments use short DNA barcode 'tags' to identify reads that originate from the same cell. In order to recover single-cell information from such experiments, reads must be grouped based on their barcode tag, a crucial processing step that precedes other computations. However, this step can be difficult due to high rates of mismatch and deletion errors that can afflict barcodes. \n\nResults: Here we present an approach to identify and error-correct barcodes by traversing the de Bruijn graph of circularized barcode k-mers. Our approach is based on the observation that circularizing a barcode sequence can yield error-free k-mers even when the size of k is large relative to the length of the barcode sequence, a regime which is typical single-cell barcoding applications. This allows for assignment of reads to consensus fingerprints constructed from k-mers. \n\nConclusion: We show that for single-cell RNA-Seq circularization improves the recovery of accurate single-cell transcriptome estimates, especially when there are a high number of errors per read. This approach is robust to the type of error (mismatch, insertion, deletion), as well as to the relative abundances of the cells. Sircel, a software package that implements this approach is described and publically available.",
        "doi": "10.1186/s12859-019-2612-0",
        "pmcid": "PMC6337828",
        "issn": "1471-2105",
        "publisher": "BioMed Central",
        "publication": "BMC Bioinformatics",
        "publication_date": "2019-01-17",
        "volume": "20",
        "pages": "Art. No. 32"
    },
    {
        "id": "authors:wt9yb-wny62",
        "collection": "authors",
        "collection_id": "wt9yb-wny62",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20181008-162020262",
        "type": "article",
        "title": "Expression reflects population structure",
        "author": [
            {
                "family_name": "Brown",
                "given_name": "Brielin C.",
                "orcid": "0000-0001-5569-5223",
                "clpid": "Brown-B-C"
            },
            {
                "family_name": "Bray",
                "given_name": "Nicolas L.",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Population structure in genotype data has been extensively studied, and is revealed by looking at the principal components of the genotype matrix. However, no similar analysis of population structure in gene expression data has been conducted, in part because a na\u00efve principal components analysis of the gene expression matrix does not cluster by population. We identify a linear projection that reveals population structure in gene expression data. Our approach relies on the coupling of the principal components of genotype to the principal components of gene expression via canonical correlation analysis. Our method is able to determine the significance of the variance in the canonical correlation projection explained by each gene. We identify 3,571 significant genes, only 837 of which had been previously reported to have an associated eQTL in the GEUVADIS results. We show that our projections are not primarily driven by differences in allele frequency at known cis-eQTLs and that similar projections can be recovered using only several hundred randomly selected genes and SNPs. Finally, we present preliminary work on the consequences for eQTL analysis. We observe that using our projection co-ordinates as covariates results in the discovery of slightly fewer genes with eQTLs, but that these genes replicate in GTEx matched tissue at a slightly higher rate.",
        "doi": "10.1371/journal.pgen.1007841",
        "pmcid": "PMC6317812",
        "issn": "1553-7390",
        "publisher": "Public Library of Science",
        "publication": "PLoS Genetics",
        "publication_date": "2018-12-19",
        "series_number": "12",
        "volume": "14",
        "issue": "12",
        "pages": "Art. No. e1007841"
    },
    {
        "id": "authors:qhp6e-5ta13",
        "collection": "authors",
        "collection_id": "qhp6e-5ta13",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20181004-091624887",
        "type": "article",
        "title": "RNA Velocity: Molecular Kinetics from Single-Cell RNA-Seq",
        "author": [
            {
                "family_name": "Svensson",
                "given_name": "Valentine",
                "orcid": "0000-0002-9217-2330",
                "clpid": "Svensson-V"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Applying a kinetic model of RNA transcription and splicing, La Manno et al. (2018) predict changes in mRNA levels of individual cells from single-cell RNA-seq data.",
        "doi": "10.1016/j.molcel.2018.09.026",
        "issn": "1097-2765",
        "publisher": "Elsevier",
        "publication": "Molecular Cell",
        "publication_date": "2018-10-04",
        "series_number": "1",
        "volume": "72",
        "issue": "1",
        "pages": "7-9"
    },
    {
        "id": "authors:qbnnk-nye38",
        "collection": "authors",
        "collection_id": "qbnnk-nye38",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20180423-152534642",
        "type": "article",
        "title": "Accurate design of translational output by a neural network model of ribosome distribution",
        "author": [
            {
                "family_name": "Tunney",
                "given_name": "Robert",
                "clpid": "Tunney-R-J"
            },
            {
                "family_name": "McGlincy",
                "given_name": "Nicholas J.",
                "orcid": "0000-0003-1412-2298",
                "clpid": "McGlincy-N-J"
            },
            {
                "family_name": "Graham",
                "given_name": "Monica E.",
                "clpid": "Graham-M-E"
            },
            {
                "family_name": "Naddaf",
                "given_name": "Nicki",
                "clpid": "Naddaf-N"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Lareau",
                "given_name": "Liana F.",
                "orcid": "0000-0003-3223-3426",
                "clpid": "Lareau-L-F"
            }
        ],
        "abstract": "Synonymous codon choice can have dramatic effects on ribosome speed and protein expression. Ribosome profiling experiments have underscored that ribosomes do not move uniformly along mRNAs. Here, we have modeled this variation in translation elongation by using a feed-forward neural network to predict the ribosome density at each codon as a function of its sequence neighborhood. Our approach revealed sequence features affecting translation elongation and characterized large technical biases in ribosome profiling. We applied our model to design synonymous variants of a fluorescent protein spanning the range of translation speeds predicted with our model. Levels of the fluorescent protein in budding yeast closely tracked the predicted translation speeds across their full range. We therefore demonstrate that our model captures information determining translation dynamics in vivo; that this information can be harnessed to design coding sequences; and that control of translation elongation alone is sufficient to produce large quantitative differences in protein output.",
        "doi": "10.1038/s41594-018-0080-2",
        "pmcid": "PMC6457438",
        "issn": "1545-9985",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Structural & Molecular Biology",
        "publication_date": "2018-07",
        "series_number": "7",
        "volume": "25",
        "issue": "7",
        "pages": "577-582"
    },
    {
        "id": "authors:q0xz8-h1p26",
        "collection": "authors",
        "collection_id": "q0xz8-h1p26",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190503-134759852",
        "type": "article",
        "title": "Association mapping from sequencing reads using k-mers",
        "author": [
            {
                "family_name": "Rahman",
                "given_name": "Atif",
                "orcid": "0000-0003-1805-3971",
                "clpid": "Rahman-Atif"
            },
            {
                "family_name": "Hallgr\u00edmsd\u00f3ttir",
                "given_name": "Ingileif",
                "clpid": "Hallgr\u00edmsd\u00f3ttir-Ingileif"
            },
            {
                "family_name": "Eisen",
                "given_name": "Michael",
                "orcid": "0000-0002-7528-738X",
                "clpid": "Eisen-Michael-B"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Genome wide association studies (GWAS) rely on microarrays, or more recently mapping of sequencing reads, to genotype individuals. The reliance on prior sequencing of a reference genome limits the scope of association studies, and also precludes mapping associations outside of the reference. We present an alignment free method for association studies of categorical phenotypes based on counting k-mers in whole-genome sequencing reads, testing for associations directly between k-mers and the trait of interest, and local assembly of the statistically significant k-mers to identify sequence differences. An analysis of the 1000 genomes data show that sequences identified by our method largely agree with results obtained using the standard approach. However, unlike standard GWAS, our method identifies associations with structural variations and sites not present in the reference genome. We also demonstrate that population stratification can be inferred from k-mers. Finally, application to an E.coli dataset on ampicillin resistance validates the approach.",
        "doi": "10.7554/elife.32920",
        "pmcid": "PMC6044908",
        "issn": "2050-084X",
        "publisher": "eLife Sciences Publications",
        "publication": "eLife",
        "publication_date": "2018-06-13",
        "volume": "7",
        "pages": "Art. No. e32920"
    },
    {
        "id": "authors:st76a-kg126",
        "collection": "authors",
        "collection_id": "st76a-kg126",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20180416-090553011",
        "type": "article",
        "title": "Gene-level differential analysis at transcript-level resolution",
        "author": [
            {
                "family_name": "Yi",
                "given_name": "Lynn",
                "orcid": "0000-0003-4575-0158",
                "clpid": "Yi-Lynn"
            },
            {
                "family_name": "Pimentel",
                "given_name": "Harold",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Bray",
                "given_name": "Nicolas L.",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Compared to RNA-sequencing transcript differential analysis, gene-level differential expression analysis is more robust and experimentally actionable. However, the use of gene counts for statistical analysis can mask transcript-level dynamics. We demonstrate that 'analysis first, aggregation second,' where the p values derived from transcript analysis are aggregated to obtain gene-level results, increase sensitivity and accuracy. The method we propose can also be applied to transcript compatibility counts obtained from pseudoalignment of reads, which circumvents the need for quantification and is fast, accurate, and model-free. The method generalizes to various levels of biology and we showcase an application to gene ontologies.",
        "doi": "10.1186/s13059-018-1419-z",
        "pmcid": "PMC5896116",
        "issn": "1474-760X",
        "publisher": "BioMed Central",
        "publication": "Genome Biology",
        "publication_date": "2018-04-12",
        "volume": "19",
        "pages": "Art. No. 53"
    },
    {
        "id": "authors:nf54s-9xm95",
        "collection": "authors",
        "collection_id": "nf54s-9xm95",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20171113-143256428",
        "type": "article",
        "title": "Longitudinal Changes in Gene Expression Associated with Disease Activity during Pregnancy and Post-Partum Among Women with Rheumatoid Arthritis",
        "author": [
            {
                "family_name": "Goin",
                "given_name": "Dana E.",
                "clpid": "Goin-D-E"
            },
            {
                "family_name": "Smed",
                "given_name": "Mette",
                "clpid": "Smed-M-K"
            },
            {
                "family_name": "Jewell",
                "given_name": "Nicholas",
                "clpid": "Jewell-N-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Nelson",
                "given_name": "J. Lee",
                "clpid": "Nelson-J-L"
            },
            {
                "family_name": "Kjaergaard",
                "given_name": "Hanne",
                "clpid": "Kjaergaard-H"
            },
            {
                "family_name": "Olsen",
                "given_name": "J\u00f8rn",
                "clpid": "Olsen-J"
            },
            {
                "family_name": "Hetland",
                "given_name": "Merete Lund",
                "clpid": "Hetland-M-L"
            },
            {
                "family_name": "Ottesen",
                "given_name": "Bent",
                "clpid": "Ottesen-B"
            },
            {
                "family_name": "Zoffmann",
                "given_name": "Vibeke",
                "clpid": "Zoffmann-V"
            },
            {
                "family_name": "Jawaheer",
                "given_name": "Damini",
                "clpid": "Jawaheer-D"
            }
        ],
        "abstract": "Background/Purpose: Many women with rheumatoid arthritis (RA) experience an improvement in disease activity\nduring pregnancy, and a predictable flare in the months after they give birth. The cause of these changes is unknown. We hypothesized that understanding biological changes (through gene expression) that occur from pre-pregnancy through the pregnancy and post-partum periods will contribute important evidence to our knowledge of the drivers of disease activity in RA during and after pregnancy.\nMethods: We have established a prospective RA pregnancy cohort, with clinical data and blood samples collected at\npre-pregnancy (T0), each trimester of pregnancy and every 3 months up to a year post-partum (up to 8 time points).\nDisease activity at each time point was assessed using disease activity scores (DAS28CRP4); women who showed an\nimprovement during pregnancy were selected for analysis (n=9). Global gene expression profiles for each sample were\ngenerated using RNA-sequencing (RNA-seq). Raw reads were pseudo-aligned and quantified using kallisto. Random\neffects regression models were used to estimate the effects of changes in gene expression on disease activity (a) from\nT0 through the pregnancy period (P1), and (b) in the post-partum period (P2). The models were adjusted for age,\nmedication status at baseline and batch effects. Significance was assessed using a threshold of q&lt;0.05 (FDR-adjusted). Functional enrichment analysis was performed using WebGestalt.\nResults: During pregnancy, 1,174 genes had expression patterns significantly associated with disease activity. While these were not significantly enriched in specific pathways, the genes whose increased expression was associated with the largest decrease (improvement) in disease activity during pregnancy were immune-related, and included ERAP1, CSNK2A1 and FAM175B. ERAP1 is involved in trimming peptides for presentation on MHC class I molecules;\nCSNK2A1 regulates cellular processes including cellular response to viral infection; FAM175B is involved in\ninterferon-signaling. In the post-partum period, 4,693 genes had expression patterns significantly associated with\ndisease activity. These were enriched (p&lt;1x10^(-6)) in numerous immune-related pathways including MAPK signaling, T\ncell receptor signaling, osteoclast differentiation, hematopoietic cell lineage, B cell receptor signaling, Toll-like receptor signaling and leukocyte trans-endothelial migration, in addition to several pathways related to cancer. The genes whose increased expression were associated with larger increases in disease activity included EI24, CMTM7, PPP2CB and BFAR which are related to tumor suppression and/or regulation of apoptosis.\nConclusion: In this pilot RA pregnancy cohort study with longitudinal RNA-seq data, several candidate genes were\nidentified as significantly associated with improvement in disease activity during pregnancy, and others were associated with post-partum flares. These results warrant further investigations into possible roles of these genes in modulating RA disease activity in a larger cohort.",
        "doi": "10.1002/art.40321",
        "issn": "2326-5191",
        "publisher": "Wiley",
        "publication": "Arthritis and Rheumatology",
        "publication_date": "2017-10",
        "series_number": "S10",
        "volume": "69",
        "issue": "S10",
        "pages": "Art. No. 2432"
    },
    {
        "id": "authors:3hhxt-h4x56",
        "collection": "authors",
        "collection_id": "3hhxt-h4x56",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20171113-145148333",
        "type": "article",
        "title": "Transcriptome Analysis in Women with Rheumatoid Arthritis Who Improve or Worsen during Pregnancy",
        "author": [
            {
                "family_name": "Goin",
                "given_name": "Dana E.",
                "clpid": "Goin-D-E"
            },
            {
                "family_name": "Smed",
                "given_name": "Mette",
                "clpid": "Smed-M-K"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Purdom",
                "given_name": "Elizabeth",
                "clpid": "Purdom-E"
            },
            {
                "family_name": "Nelson",
                "given_name": "J. Lee",
                "clpid": "Nelson-J-L"
            },
            {
                "family_name": "Kjaergaard",
                "given_name": "Hanne",
                "clpid": "Kjaergaard-H"
            },
            {
                "family_name": "Olsen",
                "given_name": "J\u00f8rn",
                "clpid": "Olsen-J"
            },
            {
                "family_name": "Hetland",
                "given_name": "Merete Lund",
                "clpid": "Hetland-M-L"
            },
            {
                "family_name": "Ottesen",
                "given_name": "Bent",
                "clpid": "Ottesen-B"
            },
            {
                "family_name": "Zoffmann",
                "given_name": "Vibeke",
                "clpid": "Zoffmann-V"
            },
            {
                "family_name": "Jawaheer",
                "given_name": "Damini",
                "clpid": "Jawaheer-D"
            }
        ],
        "abstract": "Background/Purpose: Gene expression changes induced by pregnancy in women with rheumatoid arthritis (RA) and\nhealthy women have not been examined. The few studies previously conducted did not have pre-pregnancy samples\navailable as baseline. We have established a cohort of RA and healthy women followed prospectively from pre-pregnancy.\nIn this study, we aimed to identify pregnancy-induced changes in gene expression among women with RA and healthy women, and to assess how those changes may differ between RA women who improve or worsen during pregnancy.\nMethods: Clinical data and samples collected from a subset of 11 women with RA and 5 healthy women from our cohort before pregnancy (T0) and at the third trimester (T3) were analyzed. Disease activity scores were used to determine whether the RA women improved or worsened during pregnancy. Global gene expression profiles were generated by RNA sequencing (RNA-seq). The raw RNA-seq reads were pseudo-aligned to the reference transcriptome and expression levels were estimated with kallisto. Differential expression analysis of normalized expression levels was\nperformed using edgeR to identify genes differentially expressed within each group of women (T3 vs T0), using a foldchange cut-off of 2 and a significance threshold of q&lt;0.05 (FDR-adjusted). Functional enrichment analysis was\nperformed using WebGestalt.\nResults: Of the 11 women with RA, 8 showed an improvement in disease activity by T3 (RA_(improved)), while 3\nworsened (RA_(worsened)). In the RA_(improved) group, a total of 161 genes were differentially expressed (DE) between T3 and T0. These included several genes whose expression have previously been associated with RA (e.g. S100A12, SLC14A1) as well as genes involved in the innate immune system (e.g. type I interferon-inducible genes). The majority of these genes (108 of 161) were also DE among healthy women. Of interest, most genes (30 of 31) that were\nsignificantly DE in both of the RA groups were also DE among healthy women (e.g. \u03b1-defensin genes). There were also\ndifferences between the RA_(improved) and RA_(worsened) groups. A set of IFN-inducible genes was over-expressed at T3 (vs T0) in the RA_(improved) but not the RA_(worsened) women. Additionally, some interesting candidate genes whose expression have previously been associated with RA (e.g. MMP9, PADI4 and PGLYRP1) were over-expressed at T3 (vs. T0)\namong RA_(worsened) but not among RA_(improved) women.\nConclusion: Pregnancy-induced gene expression changes common between RA women who improved and those who\nworsened appeared to be normal pregnancy-related changes that were also observed among healthy women. Other\ngenes that demonstrated different patterns of expression between the two RA groups are potential candidates that could be involved in the natural pregnancy-induced amelioration of RA.",
        "doi": "10.1002/art.40321",
        "issn": "2326-5191",
        "publisher": "Wiley",
        "publication": "Arthritis and Rheumatology",
        "publication_date": "2017-10",
        "series_number": "S10",
        "volume": "69",
        "issue": "S10",
        "pages": "Art. No. 2433"
    },
    {
        "id": "authors:15tce-gp284",
        "collection": "authors",
        "collection_id": "15tce-gp284",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-131027010",
        "type": "article",
        "title": "Pseudoalignment for metagenomic read assignment",
        "author": [
            {
                "family_name": "Schaeffer",
                "given_name": "L.",
                "clpid": "Schaeffer-L-V"
            },
            {
                "family_name": "Pimentel",
                "given_name": "H.",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Bray",
                "given_name": "N.",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Melsted",
                "given_name": "P.",
                "orcid": "0000-0002-8418-6724",
                "clpid": "Melsted-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "L.",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Motivation: Read assignment is an important first step in many metagenomic analysis workflows, providing the basis for identification and quantification of species. However ambiguity among the sequences of many strains makes it difficult to assign reads at the lowest level of taxonomy, and reads are typically assigned to taxonomic levels where they are unambiguous. We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data in order to develop novel methods for rapid and accurate quantification of metagenomic strains. \n\nResults: We find that the recent idea of pseudoalignment introduced in the RNA-Seq context is highly applicable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software, making it possible and practical for the first time to analyze abundances of individual genomes in metagenomics projects.",
        "doi": "10.1093/bioinformatics/btx106",
        "pmcid": "PMC5870846",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2017-07-15",
        "series_number": "14",
        "volume": "33",
        "issue": "14",
        "pages": "2082-2088"
    },
    {
        "id": "authors:kmsft-gtd43",
        "collection": "authors",
        "collection_id": "kmsft-gtd43",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170612-084553487",
        "type": "article",
        "title": "Differential analysis of RNA-seq incorporating quantification uncertainty",
        "author": [
            {
                "family_name": "Pimentel",
                "given_name": "Harold",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Bray",
                "given_name": "Nicolas L.",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Puente",
                "given_name": "Suzette",
                "clpid": "Puente-S"
            },
            {
                "family_name": "Melsted",
                "given_name": "P\u00e1ll",
                "orcid": "0000-0002-8418-6724",
                "clpid": "Melsted-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We describe sleuth (http://pachterlab.github.io/sleuth), a method for the differential analysis of gene expression data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance. sleuth is implemented in an interactive shiny app that utilizes kallisto quantifications and bootstraps for fast and accurate analysis of data from RNA-seq experiments.",
        "doi": "10.1038/nmeth.4324",
        "issn": "1548-7091",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Methods",
        "publication_date": "2017-07",
        "series_number": "7",
        "volume": "14",
        "issue": "7",
        "pages": "687-690"
    },
    {
        "id": "authors:jbjbb-4as73",
        "collection": "authors",
        "collection_id": "jbjbb-4as73",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170531-131446160",
        "type": "article",
        "title": "Pregnancy-induced gene expression changes in vivo among women with rheumatoid arthritis: a pilot study",
        "author": [
            {
                "family_name": "Goin",
                "given_name": "Dana E.",
                "clpid": "Goin-D-E"
            },
            {
                "family_name": "Smed",
                "given_name": "Mette Kiel",
                "clpid": "Smed-M-K"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Purdom",
                "given_name": "Elizabeth",
                "clpid": "Purdom-E"
            },
            {
                "family_name": "Nelson",
                "given_name": "J. Lee",
                "clpid": "Nelson-J-L"
            },
            {
                "family_name": "Kj\u00e6rgaard",
                "given_name": "Hanne",
                "clpid": "Kj\u00e6rgaard-H"
            },
            {
                "family_name": "Olsen",
                "given_name": "J\u00f8rn",
                "clpid": "Olsen-J"
            },
            {
                "family_name": "Hetland",
                "given_name": "Merete Lund",
                "clpid": "Hetland-M-L"
            },
            {
                "family_name": "Zoffmann",
                "given_name": "Vibeke",
                "clpid": "Zoffmann-V"
            },
            {
                "family_name": "Ottesen",
                "given_name": "Bent",
                "clpid": "Ottesen-B"
            },
            {
                "family_name": "Jawaheer",
                "given_name": "Damini",
                "clpid": "Jawaheer-D"
            }
        ],
        "abstract": "Background: Little is known about gene expression changes induced by pregnancy in women with rheumatoid arthritis (RA) and healthy women because the few studies previously conducted did not have pre-pregnancy samples available as baseline. We have established a cohort of women with RA and healthy women followed prospectively from a pre-pregnancy baseline. In this study, we tested the hypothesis that pregnancy-induced changes in gene expression among women with RA who improve during pregnancy (pregDAS_(improved)) overlap substantially with changes observed among healthy women and differ from changes observed among women with RA who worsen during pregnancy (pregDAS_(worse)). \n\nMethods: Global gene expression profiles were generated by RNA sequencing (RNA-seq) from 11 women with RA and 5 healthy women before pregnancy (T0) and at the third trimester (T3). Among the women with RA, eight showed an improvement in disease activity by T3, whereas three worsened. Differential expression analysis was used to identify genes demonstrating significant changes in expression within each of the RA and healthy groups (T3 vs T0), as well as between the groups at each time point. Gene set enrichment was assessed in terms of Gene Ontology processes and protein networks. \n\nResults: A total of 1296 genes were differentially expressed between T3 and T0 among the 8 pregDAS_(improved) women, with 161 genes showing at least two-fold change (FC) in expression by T3. The majority (108 of 161 genes) were also differentially expressed among healthy women (q&lt;0.05, FC\u22652). Additionally, a small cluster of genes demonstrated contrasting changes in expression between the pregDAS_(improved) and pregDAS_(worse) groups, all of which were inducible by type I interferon (IFN). These IFN-inducible genes were over-expressed at T3 compared to the T0 baseline among the pregDAS_(improved) women. \n\nConclusions: In our pilot RNA-seq dataset, increased pregnancy-induced expression of type I IFN-inducible genes was observed among women with RA who improved during pregnancy, but not among women who worsened. These findings warrant further investigation into expression of these genes in RA pregnancy and their potential role in modulation of disease activity. These results are nevertheless preliminary and should be interpreted with caution until replicated in a larger sample.",
        "doi": "10.1186/s13075-017-1312-2",
        "pmcid": "PMC5445464",
        "issn": "1478-6362",
        "publisher": "BioMed Central",
        "publication": "Arthritis Research and Therapy",
        "publication_date": "2017-05-25",
        "series_number": "1",
        "volume": "19",
        "issue": "1",
        "pages": "Art. No. 104"
    },
    {
        "id": "authors:7czc6-vfq51",
        "collection": "authors",
        "collection_id": "7czc6-vfq51",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170510-142406445",
        "type": "article",
        "title": "PROBer Provides a General Toolkit for Analyzing Sequencing-Based Toeprinting Assays",
        "author": [
            {
                "family_name": "Li",
                "given_name": "Bo",
                "orcid": "0000-0002-8019-8891",
                "clpid": "Li-Bo"
            },
            {
                "family_name": "Tambe",
                "given_name": "Akshay",
                "clpid": "Tambe-Akshay"
            },
            {
                "family_name": "Aviran",
                "given_name": "Sharon",
                "clpid": "Aviran-Sharon"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "A number of sequencing-based transcriptase drop-off assays have recently been developed to probe post-transcriptional dynamics of RNA-protein interaction, RNA structure, and RNA modification. Although these assays survey a diverse set of epitranscriptomic marks, we use the term toeprinting assays since they share methodological similarities. Their interpretation is predicated on addressing a similar computational challenge: how to learn isoform-specific chemical modification profiles in the face of complex read multi-mapping. We introduce PROBer, a statistical model and associated software, that addresses this challenge for the analysis of toeprinting assays. PROBer takes sequencing data as input and outputs estimated transcript abundances and isoform-specific modification profiles. Results on both simulated and biological data demonstrate that PROBer significantly outperforms individual methods tailored for specific toeprinting assays. Since the space of toeprinting assays is ever expanding and these assays are likely to be performed and analyzed together, we believe PROBer's unified data analysis solution will be valuable to the RNA community.",
        "doi": "10.1016/j.cels.2017.04.007",
        "pmcid": "PMC5758053",
        "issn": "2405-4712",
        "publisher": "Cell Press",
        "publication": "Cell Systems",
        "publication_date": "2017-05-24",
        "series_number": "5",
        "volume": "4",
        "issue": "5",
        "pages": "568-574"
    },
    {
        "id": "authors:tfn4w-6em19",
        "collection": "authors",
        "collection_id": "tfn4w-6em19",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170505-103858288",
        "type": "article",
        "title": "Zika infection of neural progenitor cells perturbs transcription in neurodevelopmental pathways",
        "author": [
            {
                "family_name": "Yi",
                "given_name": "Lynn",
                "orcid": "0000-0003-4575-0158",
                "clpid": "Yi-Lynn"
            },
            {
                "family_name": "Pimentel",
                "given_name": "Harold",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Background: A recent study of the gene expression patterns of Zika virus (ZIKV) infected human neural progenitor cells (hNPCs) revealed transcriptional dysregulation and identified cell cycle-related pathways that are affected by infection. However deeper exploration of the information present in the RNA-Seq data can be used to further elucidate the manner in which Zika infection of hNPCs affects the transcriptome, refining pathway predictions and revealing isoform-specific dynamics. \n\nMethodology/Principal findings: We analyzed data published by Tang et al. using state-of-the-art tools for transcriptome analysis. By accounting for the experimental design and estimation of technical and inferential variance we were able to pinpoint Zika infection affected pathways that highlight Zika's neural tropism. The examination of differential genes reveals cases of isoform divergence. \n\nConclusions: Transcriptome analysis of Zika infected hNPCs has the potential to identify the molecular signatures of Zika infected neural cells. These signatures may be useful for diagnostics and for the resolution of infection pathways that can be used to harvest specific targets for further study.",
        "doi": "10.1371/journal.pone.0175744",
        "pmcid": "PMC5407828",
        "issn": "1932-6203",
        "publisher": "Public Library of Science",
        "publication": "PLoS ONE",
        "publication_date": "2017-04-27",
        "series_number": "4",
        "volume": "12",
        "issue": "4",
        "pages": "Art. No. e0175744"
    },
    {
        "id": "authors:fw493-kaf67",
        "collection": "authors",
        "collection_id": "fw493-kaf67",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190506-141348569",
        "type": "article",
        "title": "The Lair: a resource for exploratory analysis of published RNA-Seq data",
        "author": [
            {
                "family_name": "Pimentel",
                "given_name": "Harold",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Pascal",
                "clpid": "Sturmfels-P"
            },
            {
                "family_name": "Bray",
                "given_name": "Nicolas",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Melsted",
                "given_name": "P\u00e1ll",
                "orcid": "0000-0002-8418-6724",
                "clpid": "Melsted-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Increased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is difficult to use in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Sequence Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair",
        "doi": "10.1186/s12859-016-1357-2",
        "pmcid": "PMC5131447",
        "issn": "1471-2105",
        "publisher": "BioMed Central",
        "publication": "BMC Bioinformatics",
        "publication_date": "2016-12-01",
        "series_number": "1",
        "volume": "17",
        "issue": "1",
        "pages": "Art. No. 490"
    },
    {
        "id": "authors:cz0cp-gge47",
        "collection": "authors",
        "collection_id": "cz0cp-gge47",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-150123048",
        "type": "article",
        "title": "Estimating intrinsic and extrinsic noise from single-cell gene expression measurements",
        "author": [
            {
                "family_name": "Fu",
                "given_name": "Audrey Qiuyan",
                "clpid": "Fu-Audrey-Qiuyan"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Gene expression is stochastic and displays variation (\"noise\") both within and between cells. Intracellular (intrinsic) variance can be distinguished from extracellular (extrinsic) variance by applying the law of total variance to data from two-reporter assays that probe expression of identically regulated gene pairs in single cells. We examine established formulas [Elowitz, M. B., A. J. Levine, E. D. Siggia and P. S. Swain (2002): \"Stochastic gene expression in a single cell,\" Science, 297, 1183\u20131186.] for the estimation of intrinsic and extrinsic noise and provide interpretations of them in terms of a hierarchical model. This allows us to derive alternative estimators that minimize bias or mean squared error. We provide a geometric interpretation of these results that clarifies the interpretation in [Elowitz, M. B., A. J. Levine, E. D. Siggia and P. S. Swain (2002): \"Stochastic gene expression in a single cell,\" Science, 297, 1183\u20131186.]. We also demonstrate through simulation and re-analysis of published data that the distribution assumptions underlying the hierarchical model have to be satisfied for the estimators to produce sensible results, which highlights the importance of normalization.",
        "doi": "10.1515/sagmb-2016-0002",
        "pmcid": "PMC5518956",
        "issn": "2194-6302",
        "publisher": "De Gruyter",
        "publication": "Statistical Applications in Genetics and Molecular Biology",
        "publication_date": "2016-12",
        "series_number": "6",
        "volume": "15",
        "issue": "6",
        "pages": "447-471"
    },
    {
        "id": "authors:pv1dd-91s03",
        "collection": "authors",
        "collection_id": "pv1dd-91s03",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190503-153138885",
        "type": "article",
        "title": "Single-cell analysis at the threshold",
        "author": [
            {
                "family_name": "Chen",
                "given_name": "Xi",
                "orcid": "0000-0003-2648-3146",
                "clpid": "Chen-Xi-BIO"
            },
            {
                "family_name": "Love",
                "given_name": "J. Christopher",
                "clpid": "Love-J-C"
            },
            {
                "family_name": "Navin",
                "given_name": "Nicholas E.",
                "clpid": "Navin-N-E"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Stubbington",
                "given_name": "Michael J. T.",
                "clpid": "Stubbington-M-J-T"
            },
            {
                "family_name": "Svensson",
                "given_name": "Valentine",
                "orcid": "0000-0002-9217-2330",
                "clpid": "Svensson-V"
            },
            {
                "family_name": "Sweedler",
                "given_name": "Jonathan V.",
                "clpid": "Sweedler-J-V"
            },
            {
                "family_name": "Teichmann",
                "given_name": "Sarah A.",
                "orcid": "0000-0002-6294-6366",
                "clpid": "Teichmann-S-A"
            }
        ],
        "abstract": "A discussion of some of the challenges and promise of single-cell technology.",
        "doi": "10.1038/nbt.3721",
        "issn": "1087-0156",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Biotechnology",
        "publication_date": "2016-11",
        "series_number": "11",
        "volume": "34",
        "issue": "11",
        "pages": "1111-1118"
    },
    {
        "id": "authors:jj12q-40t75",
        "collection": "authors",
        "collection_id": "jj12q-40t75",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-154829565",
        "type": "article",
        "title": "Transcriptomic response of Drosophila melanogaster pupae developed in hypergravity",
        "author": [
            {
                "family_name": "Hateley",
                "given_name": "Shannon",
                "clpid": "Hateley-S"
            },
            {
                "family_name": "Hosamani",
                "given_name": "Ravikumar",
                "clpid": "Hosamani-R"
            },
            {
                "family_name": "Bhardwaj",
                "given_name": "Shilpa R.",
                "clpid": "Bhardwaj-S-R"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Bhattacharya",
                "given_name": "Sharmila",
                "clpid": "Bhattacharya-S"
            }
        ],
        "abstract": "Altered gravity can perturb normal development and induce corresponding changes in gene expression. Understanding this relationship between the physical environment and a biological response is important for NASA's space travel goals. We use RNA-Seq and qRT-PCR techniques to profile changes in early Drosophila melanogaster pupae exposed to chronic hypergravity (3 g, or three times Earth's gravity). During the pupal stage, D. melanogaster rely upon gravitational cues for proper development. Assessing gene expression changes in the pupae under altered gravity conditions helps highlight gravity-dependent genetic pathways. A robust transcriptional response was observed in hypergravity-treated pupae compared to controls, with 1513 genes showing a significant (q &lt; 0.05) difference in gene expression. Five major biological processes were affected: ion transport, redox homeostasis, immune response, proteolysis, and cuticle development. \n\nThis outlines the underlying molecular and biological changes occurring in Drosophila pupae in response to hypergravity; gravity is important for many biological processes on Earth.",
        "doi": "10.1016/j.ygeno.2016.09.002",
        "issn": "0888-7543",
        "publisher": "Elsevier",
        "publication": "Genomics",
        "publication_date": "2016-10",
        "series_number": "3-4",
        "volume": "108",
        "issue": "3-4",
        "pages": "158-167"
    },
    {
        "id": "authors:3dew4-m0c86",
        "collection": "authors",
        "collection_id": "3dew4-m0c86",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190503-155957743",
        "type": "article",
        "title": "Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts",
        "author": [
            {
                "family_name": "Ntranos",
                "given_name": "Vasilis",
                "orcid": "0000-0002-2477-0670",
                "clpid": "Ntranos-V"
            },
            {
                "family_name": "Kamath",
                "given_name": "Govinda M.",
                "clpid": "Kamath-G-M"
            },
            {
                "family_name": "Zhang",
                "given_name": "Jesse M.",
                "clpid": "Zhang-Jesse-M"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Tse",
                "given_name": "David N.",
                "clpid": "Tse-David-N"
            }
        ],
        "abstract": "Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling, which limits their scope and generality. We propose a novel method that compares and clusters cells based on their transcript-compatibility read counts rather than on the transcript or gene quantifications used in standard analysis pipelines. In the reanalysis of two landmark yet disparate single-cell RNA-seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays.",
        "doi": "10.1186/s13059-016-0970-8",
        "pmcid": "PMC4881296",
        "issn": "1474-760X",
        "publisher": "BioMed Central",
        "publication": "Genome Biology",
        "publication_date": "2016-05-26",
        "series_number": "1",
        "volume": "17",
        "issue": "1",
        "pages": "Art. No. 112"
    },
    {
        "id": "authors:9cqwz-arv02",
        "collection": "authors",
        "collection_id": "9cqwz-arv02",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190506-110012992",
        "type": "article",
        "title": "Near-optimal probabilistic RNA-seq quantification",
        "author": [
            {
                "family_name": "Bray",
                "given_name": "Nicolas L.",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Pimentel",
                "given_name": "Harold",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Melsted",
                "given_name": "P\u00e1ll",
                "orcid": "0000-0002-8418-6724",
                "clpid": "Melsted-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in &lt;10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis.",
        "doi": "10.1038/nbt.3519",
        "issn": "1087-0156",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Biotechnology",
        "publication_date": "2016-05",
        "series_number": "5",
        "volume": "34",
        "issue": "5",
        "pages": "525-527"
    },
    {
        "id": "authors:6h7fw-ebc29",
        "collection": "authors",
        "collection_id": "6h7fw-ebc29",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-131213123",
        "type": "article",
        "title": "A dynamic intron retention program enriched in RNA processing genes regulates gene expression during terminal erythropoiesis",
        "author": [
            {
                "family_name": "Pimentel",
                "given_name": "Harold",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Parra",
                "given_name": "Marilyn",
                "clpid": "Parra-M"
            },
            {
                "family_name": "Gee",
                "given_name": "Sherry L.",
                "clpid": "Gee-S-L"
            },
            {
                "family_name": "Mohandas",
                "given_name": "Narla",
                "clpid": "Mohandas-N"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Conboy",
                "given_name": "John G.",
                "clpid": "Conboy-J-G"
            }
        ],
        "abstract": "Differentiating erythroblasts execute a dynamic alternative splicing program shown here to include extensive and diverse intron retention (IR) events. Cluster analysis revealed hundreds of developmentally-dynamic introns that exhibit increased IR in mature erythroblasts, and are enriched in functions related to RNA processing such as SF3B1 spliceosomal factor. Distinct, developmentally-stable IR clusters are enriched in metal-ion binding functions and include mitoferrin genes SLC25A37 and SLC25A28 that are critical for iron homeostasis. Some IR transcripts are abundant, e.g. comprising \u223c50% of highly-expressed SLC25A37 and SF3B1 transcripts in late erythroblasts, and thereby limiting functional mRNA levels. IR transcripts tested were predominantly nuclear-localized. Splice site strength correlated with IR among stable but not dynamic intron clusters, indicating distinct regulation of dynamically-increased IR in late erythroblasts. Retained introns were preferentially associated with alternative exons with premature termination codons (PTCs). High IR was observed in disease-causing genes including SF3B1 and the RNA binding protein FUS. Comparative studies demonstrated that the intron retention program in erythroblasts shares features with other tissues but ultimately is unique to erythropoiesis. We conclude that IR is a multi-dimensional set of processes that post-transcriptionally regulate diverse gene groups during normal erythropoiesis, misregulation of which could be responsible for human disease.",
        "doi": "10.1093/nar/gkv1168",
        "pmcid": "PMC4737145",
        "issn": "0305-1048",
        "publisher": "Oxford University Press",
        "publication": "Nucleic Acids Research",
        "publication_date": "2016-01-29",
        "series_number": "2",
        "volume": "44",
        "issue": "2",
        "pages": "838-851"
    },
    {
        "id": "authors:2n03m-r5660",
        "collection": "authors",
        "collection_id": "2n03m-r5660",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-124807271",
        "type": "article",
        "title": "Pregnancy-Induced Changes in Systemic Gene Expression among Healthy Women and Women with Rheumatoid Arthritis",
        "author": [
            {
                "family_name": "Mittal",
                "given_name": "Anuradha",
                "clpid": "Mittal-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Nelson",
                "given_name": "J. Lee",
                "clpid": "Nelson-J-L"
            },
            {
                "family_name": "Kj\u00e6rgaard",
                "given_name": "Hanne",
                "clpid": "Kj\u00e6rgaard-H"
            },
            {
                "family_name": "Smed",
                "given_name": "Mette Kiel",
                "clpid": "Smed-M-K"
            },
            {
                "family_name": "Gildengorin",
                "given_name": "Virginia L.",
                "clpid": "Gildengorin-V-L"
            },
            {
                "family_name": "Zoffmann",
                "given_name": "Vibeke",
                "clpid": "Zoffmann-V"
            },
            {
                "family_name": "Hetland",
                "given_name": "Merete Lund",
                "clpid": "Hetland-M-L"
            },
            {
                "family_name": "Jewell",
                "given_name": "Nicholas P.",
                "clpid": "Jewell-N-P"
            },
            {
                "family_name": "Olsen",
                "given_name": "J\u00f8rn",
                "clpid": "Olsen-J"
            },
            {
                "family_name": "Jawaheer",
                "given_name": "Damini",
                "clpid": "Jawaheer-D"
            }
        ],
        "abstract": "Background: Pregnancy induces drastic biological changes systemically, and has a beneficial effect on some autoimmune conditions such as rheumatoid arthritis (RA). However, specific systemic changes that occur as a result of pregnancy have not been thoroughly examined in healthy women or women with RA. The goal of this study was to identify genes with expression patterns associated with pregnancy, compared to pre-pregnancy as baseline and determine whether those associations are modified by presence of RA. \n\nResults: In our RNA sequencing (RNA-seq) dataset from 5 healthy women and 20 women with RA, normalized expression levels of 4,710 genes were significantly associated with pregnancy status (pre-pregnancy, first, second and third trimesters) over time, irrespective of presence of RA (False Discovery Rate (FDR)-adjusted p value&lt;0.05). These genes were enriched in pathways spanning multiple systems, as would be expected during pregnancy. A subset of these genes (n = 256) showed greater than two-fold change in expression during pregnancy compared to baseline levels, with distinct temporal trends through pregnancy. Another 98 genes involved in various biological processes including immune regulation exhibited expression patterns that were differentially associated with pregnancy in the presence or absence of RA. \n\nConclusions: Our findings support the hypothesis that the maternal immune system plays an active role during pregnancy, and also provide insight into other systemic changes that occur in the maternal transcriptome during pregnancy compared to the pre-pregnancy state. Only a small proportion of genes modulated by pregnancy were influenced by presence of RA in our data.",
        "doi": "10.1371/journal.pone.0145204",
        "pmcid": "PMC4684291",
        "issn": "1932-6203",
        "publisher": "Public Library of Science",
        "publication": "PLOS ONE",
        "publication_date": "2015-12-18",
        "series_number": "12",
        "volume": "10",
        "issue": "12",
        "pages": "Art. No. e0145204"
    },
    {
        "id": "authors:nrkwk-ccq57",
        "collection": "authors",
        "collection_id": "nrkwk-ccq57",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-130217211",
        "type": "article",
        "title": "Single-cell transcriptomics reveals receptor transformations during olfactory neurogenesis",
        "author": [
            {
                "family_name": "Hanchate",
                "given_name": "Naresh K.",
                "clpid": "Hanchate-N-K"
            },
            {
                "family_name": "Kondoh",
                "given_name": "Kunio",
                "clpid": "Kondoh-Kunio"
            },
            {
                "family_name": "Lu",
                "given_name": "Zhonglua",
                "clpid": "Lu-Zhonglua"
            },
            {
                "family_name": "Kuang",
                "given_name": "Donghui",
                "clpid": "Kuand-Donghui"
            },
            {
                "family_name": "Ye",
                "given_name": "Xiaolan",
                "clpid": "Ye-Xiaolan"
            },
            {
                "family_name": "Qiu",
                "given_name": "Xiaojie",
                "clpid": "Qiu-Xiaojie"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Buck",
                "given_name": "Linda B.",
                "clpid": "Buck-L-B"
            }
        ],
        "abstract": "The sense of smell allows chemicals to be perceived as diverse scents. We used single neuron RNA-Sequencing (RNA-Seq) to explore developmental mechanisms that shape this ability as nasal olfactory neurons mature in mice. Most mature neurons expressed only one of the roughly 1000 odorant receptor genes (Olfrs) available, and that at high levels. However, many immature neurons expressed low levels of multiple Olfrs. Coexpressed Olfrs localized to overlapping zones of the nasal epithelium, suggesting regional biases, but not to single genomic loci. A single immature neuron could express Olfrs from up to seven different chromosomes. The mature state in which expression of Olfr genes is restricted to one per neuron emerges over a developmental progression that appears independent of neuronal activity requiring sensory transduction molecules.",
        "doi": "10.1126/science.aad2456",
        "pmcid": "PMC5642900",
        "issn": "0036-8075",
        "publisher": "American Association for the Advancement of Science",
        "publication": "Science",
        "publication_date": "2015-12-04",
        "series_number": "6265",
        "volume": "350",
        "issue": "6265",
        "pages": "1251-1255"
    },
    {
        "id": "authors:bj6ys-b0g23",
        "collection": "authors",
        "collection_id": "bj6ys-b0g23",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20150721-085701041",
        "type": "article",
        "title": "The NIH BD2K center for big data in translational genomics",
        "author": [
            {
                "family_name": "Paten",
                "given_name": "Benedict",
                "clpid": "Paten-B"
            },
            {
                "family_name": "Diekhans",
                "given_name": "Mark",
                "clpid": "Diekhans-M"
            },
            {
                "family_name": "Druker",
                "given_name": "Brian J.",
                "clpid": "Druker-B-J"
            },
            {
                "family_name": "Friend",
                "given_name": "Stephen",
                "clpid": "Friend-S"
            },
            {
                "family_name": "Guinney",
                "given_name": "Justin",
                "clpid": "Guinney-J"
            },
            {
                "family_name": "Gassner",
                "given_name": "Nadine",
                "clpid": "Gassner-N"
            },
            {
                "family_name": "Guttman",
                "given_name": "Mitchell",
                "orcid": "0000-0003-4748-9352",
                "clpid": "Guttman-M"
            },
            {
                "family_name": "Kent",
                "given_name": "W. James",
                "clpid": "Kent-W-J"
            },
            {
                "family_name": "Mantey",
                "given_name": "Patrick",
                "clpid": "Mantey-P"
            },
            {
                "family_name": "Margolin",
                "given_name": "Adam A.",
                "clpid": "Margolin-A-A"
            },
            {
                "family_name": "Massie",
                "given_name": "Matt",
                "clpid": "Massie-M"
            },
            {
                "family_name": "Novak",
                "given_name": "Adam M.",
                "clpid": "Novak-A-M"
            },
            {
                "family_name": "Nothaft",
                "given_name": "Frank",
                "clpid": "Nothaft-F"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Patterson",
                "given_name": "David",
                "clpid": "Patterson-D"
            },
            {
                "family_name": "Smuga-Otto",
                "given_name": "Maciej",
                "clpid": "Smuga-Otto-M"
            },
            {
                "family_name": "Stuart",
                "given_name": "Joshua M.",
                "clpid": "Stuart-J-M"
            },
            {
                "family_name": "Van't Veer",
                "given_name": "Laura",
                "clpid": "Van't-Veer-L"
            },
            {
                "family_name": "Wold",
                "given_name": "Barbara",
                "orcid": "0000-0003-3235-8130",
                "clpid": "Wold-B-J"
            },
            {
                "family_name": "Haussler",
                "given_name": "David",
                "clpid": "Haussler-D"
            }
        ],
        "abstract": "The world's genomics data will never be stored in a single repository \u2013 rather, it will be distributed among many sites in many\ncountries. No one site will have enough data to explain genotype to phenotype relationships in rare diseases; therefore, sites must\nshare data. To accomplish this, the genetics community must forge common standards and protocols to make sharing and computing\ndata among many sites a seamless activity. Through the Global Alliance for Genomics and Health, we are pioneering the development\nof shared application programming interfaces (APIs) to connect the world's genome repositories. In parallel, we are developing\nan open source software stack (ADAM) that uses these APIs. This combination will create a cohesive genome informatics\necosystem. Using containers, we are facilitating the deployment of this software in a diverse array of environments. Through benchmarking\nefforts and big data driver projects, we are ensuring ADAM's performance and utility.",
        "doi": "10.1093/jamia/ocv047",
        "pmcid": "PMC5009913",
        "issn": "1067-5027",
        "publisher": "American Medical Informatics Association",
        "publication": "Journal of the American Medical Informatics Association",
        "publication_date": "2015-11",
        "series_number": "6",
        "volume": "22",
        "issue": "6",
        "pages": "1143-1147"
    },
    {
        "id": "authors:vg27x-zhk63",
        "collection": "authors",
        "collection_id": "vg27x-zhk63",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-132106100",
        "type": "article",
        "title": "Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas",
        "author": [
            {
                "family_name": "Brat",
                "given_name": "Daniel J.",
                "clpid": "Brat-D-J"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "literal": "Cancer Genome Atlas Research Network"
            }
        ],
        "abstract": "BACKGROUND: Diffuse low-grade and intermediate-grade gliomas (which together make up the lower-grade gliomas, World Health Organization grades II and III) have highly variable clinical behavior that is not adequately predicted on the basis of histologic class. Some are indolent; others quickly progress to glioblastoma. The uncertainty is compounded by interobserver variability in histologic diagnosis. Mutations in IDH, TP53, and ATRX and codeletion of chromosome arms 1p and 19q (1p/19q codeletion) have been implicated as clinically relevant markers of lower-grade gliomas. \n\nMETHODS: We performed genomewide analyses of 293 lower-grade gliomas from adults, incorporating exome sequence, DNA copy number, DNA methylation, messenger RNA expression, microRNA expression, and targeted protein expression. These data were integrated and tested for correlation with clinical outcomes. \n\nRESULTS: Unsupervised clustering of mutations and data from RNA, DNA-copy-number, and DNA-methylation platforms uncovered concordant classification of three robust, nonoverlapping, prognostically significant subtypes of lower-grade glioma that were captured more accurately by IDH, 1p/19q, and TP53 status than by histologic class. Patients who had lower-grade gliomas with an IDH mutation and 1p/19q codeletion had the most favorable clinical outcomes. Their gliomas harbored mutations in CIC, FUBP1, NOTCH1, and the TERT promoter. Nearly all lower-grade gliomas with IDH mutations and no 1p/19q codeletion had mutations in TP53 (94%) and ATRX inactivation (86%). The large majority of lower-grade gliomas without an IDH mutation had genomic aberrations and clinical behavior strikingly similar to those found in primary glioblastoma. \n\nCONCLUSIONS: The integration of genomewide data from multiple platforms delineated three molecular classes of lower-grade gliomas that were more concordant with IDH, 1p/19q, and TP53 status than with histologic class. Lower-grade gliomas with an IDH mutation either had 1p/19q codeletion or carried a TP53 mutation. Most lower-grade gliomas without an IDH mutation were molecularly and clinically similar to glioblastoma. (Funded by the National Institutes of Health.)",
        "doi": "10.1056/NEJMoa1402121",
        "pmcid": "PMC4530011",
        "issn": "0028-4793",
        "publisher": "Massachusetts Medical Society",
        "publication": "New England Journal of Medicine",
        "publication_date": "2015-06-25",
        "series_number": "26",
        "volume": "372",
        "issue": "26",
        "pages": "2481-2498"
    },
    {
        "id": "authors:3y9vj-fab67",
        "collection": "authors",
        "collection_id": "3y9vj-fab67",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-133219740",
        "type": "article",
        "title": "Controlling for conservation in genome-wide DNA methylation studies",
        "author": [
            {
                "family_name": "Singer",
                "given_name": "Meromit",
                "clpid": "Singer-M"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "BACKGROUND: A commonplace analysis in high-throughput DNA methylation studies is the comparison of methylation extent between different functional regions, computed by averaging methylation states within region types and then comparing averages between regions. For example, it has been reported that methylation is more prevalent in coding regions as compared to their neighboring introns or UTRs, leading to hypotheses about novel forms of epigenetic regulation. \n\nRESULTS: We have identified and characterized a bias present in these seemingly straightforward comparisons that results in the false detection of differences in methylation intensities across region types. This bias arises due to differences in conservation rates, rather than methylation rates, and is broadly present in the published literature. When controlling for conservation at coding start sites the differences in DNA methylation rates disappear. Moreover, a re-evaluation of methylation rates at intronexon junctions reveals that the magnitude of previously reported differences is greatly exaggerated. We introduce two correction methods to address this bias, an inference-based matrix completion algorithm and an averaging approach, tailored to address different underlying biological questions. We evaluate how analysis using these corrections affects the detection of differences in DNA methylation across functional boundaries. \n\nCONCLUSIONS: We report here on a bias in DNA methylation comparative studies that originates in conservation rate differences and manifests itself in the false discovery of differences in DNA methylation intensities and their extents. We have characterized this bias and its broad implications, and show how to control for it so as to enable the study of a variety of biological questions.",
        "doi": "10.1186/s12864-015-1604-3",
        "pmcid": "PMC4448855",
        "issn": "1471-2164",
        "publisher": "BioMed Central",
        "publication": "BMC Genomics",
        "publication_date": "2015-05-30",
        "volume": "16",
        "pages": "Art. No. 420"
    },
    {
        "id": "authors:qbbbd-k1850",
        "collection": "authors",
        "collection_id": "qbbbd-k1850",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-133948010",
        "type": "article",
        "title": "A diverse epigenetic landscape at human exons with implication for expression",
        "author": [
            {
                "family_name": "Singer",
                "given_name": "Meromit",
                "clpid": "Singer-M"
            },
            {
                "family_name": "Kosti",
                "given_name": "Idit",
                "clpid": "Kosti-I"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Mandel-Gutfreund",
                "given_name": "Yael",
                "clpid": "Mandel-Gutfreund-Y"
            }
        ],
        "abstract": "DNA methylation is an important epigenetic marker associated with gene expression regulation in eukaryotes. While promoter methylation is relatively well characterized, the role of intragenic DNA methylation remains unclear. Here, we investigated the relationship of DNA methylation at exons and flanking introns with gene expression and histone modifications generated from a human fibroblast cell-line and primary B cells. Consistent with previous work we found that intragenic methylation is positively correlated with gene expression and that exons are more highly methylated than their neighboring intronic environment. Intriguingly, in this study we identified a unique subset of hypomethylated exons that demonstrate significantly lower methylation levels than their surrounding introns. Furthermore, we observed a negative correlation between exon methylation and the density of the majority of histone modifications. Specifically, we demonstrate that hypo-methylated exons at highly expressed genes are associated with open chromatin and have a characteristic histone code comprised of significantly high levels of histone markings. Overall, our comprehensive analysis of the human exome supports the presence of regulatory hypomethylated exons in protein coding genes. In particular our results reveal a previously unrecognized diverse and complex role of the epigenetic landscape within the gene body.",
        "doi": "10.1093/nar/gkv153",
        "pmcid": "PMC4402514",
        "issn": "0305-1048",
        "publisher": "Oxford University Press",
        "publication": "Nucleic Acids Research",
        "publication_date": "2015-04-20",
        "series_number": "7",
        "volume": "43",
        "issue": "7",
        "pages": "3498-3508"
    },
    {
        "id": "authors:acq1h-k9e77",
        "collection": "authors",
        "collection_id": "acq1h-k9e77",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-134607827",
        "type": "article",
        "title": "Rational experiment design for sequencing-based RNA structure mapping",
        "author": [
            {
                "family_name": "Aviran",
                "given_name": "Sharon",
                "clpid": "Aviran-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Structure mapping is a classic experimental approach for determining nucleic acid structure that has gained renewed interest in recent years following advances in chemistry, genomics, and informatics. The approach encompasses numerous techniques that use different means to introduce nucleotide-level modifications in a structure-dependent manner. Modifications are assayed via cDNA fragment analysis, using electrophoresis or next-generation sequencing (NGS). The recent advent of NGS has dramatically increased the throughput, multiplexing capacity, and scope of RNA structure mapping assays, thereby opening new possibilities for genome-scale, de novo, and in vivo studies. From an informatics standpoint, NGS is more informative than prior technologies by virtue of delivering direct molecular measurements in the form of digital sequence counts. Motivated by these new capabilities, we introduce a novel model-based in silico approach for quantitative design of large-scale multiplexed NGS structure mapping assays, which takes advantage of the direct and digital nature of NGS readouts. We use it to characterize the relationship between controllable experimental parameters and the precision of mapping measurements. Our results highlight the complexity of these dependencies and shed light on relevant tradeoffs and pitfalls, which can be difficult to discern by intuition alone. We demonstrate our approach by quantitatively assessing the robustness of SHAPE-Seq measurements, obtained by multiplexing SHAPE (selective 2\u2032-hydroxyl acylation analyzed by primer extension) chemistry in conjunction with NGS. We then utilize it to elucidate design considerations in advanced genome-wide approaches for probing the transcriptome, which recently obtained in vivo information using dimethyl sulfate (DMS) chemistry.",
        "doi": "10.1261/rna.043844.113",
        "pmcid": "PMC4238353",
        "issn": "1355-8382",
        "publisher": "RNA Society",
        "publication": "RNA",
        "publication_date": "2014-12",
        "series_number": "12",
        "volume": "20",
        "issue": "12",
        "pages": "1864-1877"
    },
    {
        "id": "authors:57pcw-w6930",
        "collection": "authors",
        "collection_id": "57pcw-w6930",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-141031221",
        "type": "article",
        "title": "Human Intestinal Tissue with Adult Stem Cell Properties Derived from Pluripotent Stem Cells",
        "author": [
            {
                "family_name": "Forster",
                "given_name": "Ryan",
                "clpid": "Forster-R"
            },
            {
                "family_name": "Chiba",
                "given_name": "Kunitoshi",
                "clpid": "Chiba-Kunitoshi"
            },
            {
                "family_name": "Schaeffer",
                "given_name": "Lorian",
                "clpid": "Schaeffer-L-V"
            },
            {
                "family_name": "Regalado",
                "given_name": "Samuel G.",
                "clpid": "Regalado-S-G"
            },
            {
                "family_name": "Lai",
                "given_name": "Christine S.",
                "clpid": "Lai-Christine-S"
            },
            {
                "family_name": "Gao",
                "given_name": "Qing",
                "clpid": "Gao-Qing"
            },
            {
                "family_name": "Kiani",
                "given_name": "Samira",
                "clpid": "Kiana-S"
            },
            {
                "family_name": "Farin",
                "given_name": "Henner F.",
                "clpid": "Farin-H-F"
            },
            {
                "family_name": "Clevers",
                "given_name": "Hans",
                "clpid": "Clevers-H"
            },
            {
                "family_name": "Cost",
                "given_name": "Gregory J.",
                "clpid": "Cost-G-J"
            },
            {
                "family_name": "Chan",
                "given_name": "Andy",
                "clpid": "Chan-Andy"
            },
            {
                "family_name": "Rebar",
                "given_name": "Edward J.",
                "clpid": "Rebar-E-J"
            },
            {
                "family_name": "Urnov",
                "given_name": "Fyodor D.",
                "clpid": "Urnov-F-D"
            },
            {
                "family_name": "Gregory",
                "given_name": "Philip D.",
                "clpid": "Gregory-P-D"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Jaenisch",
                "given_name": "Rudolf",
                "clpid": "Jaenisch-R"
            },
            {
                "family_name": "Hockemeyer",
                "given_name": "Dirk",
                "clpid": "Hockemeyer-D"
            }
        ],
        "abstract": "Genetically engineered human pluripotent stem cells (hPSCs) have been proposed as a source for transplantation therapies and are rapidly becoming valuable tools for human disease modeling. However, many applications are limited due to the lack of robust differentiation paradigms that allow for the isolation of defined functional tissues. Here, using an endogenous LGR5-GFP reporter, we derived adult stem cells from hPSCs that gave rise to functional human intestinal tissue comprising all major cell types of the intestine. Histological and functional analyses revealed that such human organoid cultures could be derived with high purity and with a composition and morphology similar to those of cultures obtained from human biopsies. Importantly, hPSC-derived organoids responded to the canonical signaling pathways that control self-renewal and differentiation in the adult human intestinal stem cell compartment. This adult stem cell system provides a platform for studying human intestinal disease in vitro using genetically engineered hPSCs.",
        "doi": "10.1016/j.stemcr.2014.05.001",
        "pmcid": "PMC4050346",
        "issn": "2213-6711",
        "publisher": "Elsevier",
        "publication": "Stem Cell Reports",
        "publication_date": "2014-06-03",
        "series_number": "6",
        "volume": "2",
        "issue": "6",
        "pages": "838-852"
    },
    {
        "id": "authors:ffw23-m8n84",
        "collection": "authors",
        "collection_id": "ffw23-m8n84",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190507-112223935",
        "type": "article",
        "title": "Structural Variation among Wild and Industrial Strains of Penicillium chrysogenum",
        "author": [
            {
                "family_name": "Wong",
                "given_name": "Valerie L.",
                "clpid": "Wong-Valerie-L"
            },
            {
                "family_name": "Ellison",
                "given_name": "Christopher E.",
                "clpid": "Ellison-C-E"
            },
            {
                "family_name": "Eisen",
                "given_name": "Michael B.",
                "orcid": "0000-0002-7528-738X",
                "clpid": "Eisen-M-B"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Brem",
                "given_name": "Rachel B.",
                "clpid": "Brem-R-B"
            }
        ],
        "abstract": "Strain selection and strain improvement are the first, and arguably most important, steps in the industrial production of biological compounds by microorganisms. While traditional methods of mutagenesis and selection have been effective in improving production of compounds at a commercial scale, the genetic changes underpinning the altered phenotypes have remained largely unclear. We utilized high-throughput Illumina short read sequencing of a wild Penicillium chrysogenum strain in order to make whole genome comparisons to a sequenced improved strain (WIS 54\u20131255). We developed an assembly-free method of identifying chromosomal rearrangements and validated the in silico predictions with a PCR-based assay and Sanger sequencing. Despite many rounds of mutagen treatment and artificial selection, WIS 54\u20131255 differs from its wild progenitor at only one of the identified rearrangements. We suggest that natural variants predisposed for high penicillin production were instrumental in the success of WIS 54\u20131255 as an industrial strain. In addition to finding a previously published inversion in the penicillin biosynthesis cluster, we located several genes related to penicillin production associated with these rearrangements. By comparing the configuration of rearrangement events among several historically important strains known to be high penicillin producers to a collection of recently isolated wild strains, we suggest that wild strains with rearrangements similar to those in known high penicillin producers may be viable candidates for further improvement efforts.",
        "doi": "10.1371/journal.pone.0096784",
        "pmcid": "PMC4019546",
        "issn": "1932-6203",
        "publisher": "Public Library of Science",
        "publication": "PLoS ONE",
        "publication_date": "2014-05-13",
        "series_number": "5",
        "volume": "9",
        "issue": "5",
        "pages": "Art. No. e96784"
    },
    {
        "id": "authors:xw760-6ct89",
        "collection": "authors",
        "collection_id": "xw760-6ct89",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-142644662",
        "type": "article",
        "title": "Genome methylation in D. melanogaster is found at specific short motifs and is independent of DNMT2 activity",
        "author": [
            {
                "family_name": "Takayama",
                "given_name": "Sachiko",
                "clpid": "Takayama-Sachiko"
            },
            {
                "family_name": "Dhahbi",
                "given_name": "Joseph",
                "clpid": "Dhahbi-Joseph"
            },
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-A"
            },
            {
                "family_name": "Mao",
                "given_name": "Guanxiong",
                "clpid": "Mao-Guanxiong"
            },
            {
                "family_name": "Heo",
                "given_name": "Seok-Jin",
                "clpid": "Heo-Seok-Jin"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Martin",
                "given_name": "David I. K.",
                "clpid": "Martin-David-I-K"
            },
            {
                "family_name": "Boffelli",
                "given_name": "Dario",
                "clpid": "Boffelli-D"
            }
        ],
        "abstract": "Cytosine methylation in the genome of Drosophila melanogaster has been elusive and controversial: Its location and function have not been established. We have used a novel and highly sensitive genomewide cytosine methylation assay to detect and map genome methylation in stage 5 Drosophila embryos. The methylation we observe with this method is highly localized and strand asymmetrical, limited to regions covering \u223c1% of the genome, dynamic in early embryogenesis, and concentrated in specific 5-base sequence motifs that are CA- and CT-rich but depleted of guanine. Gene body methylation is associated with lower expression, and many genes containing methylated regions have developmental or transcriptional functions. The only known DNA methyltransferase in Drosophila is the DNMT2 homolog MT2, but lines deficient for MT2 retain genomic methylation, implying the presence of a novel methyltransferase. The association of methylation with a lower expression of specific developmental genes at stage 5 raises the possibility that it participates in controlling gene expression during the maternal-zygotic transition.",
        "doi": "10.1101/gr.162412.113",
        "pmcid": "PMC4009611",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2014-05",
        "series_number": "5",
        "volume": "24",
        "issue": "5",
        "pages": "821-830"
    },
    {
        "id": "authors:jjxxg-g1y43",
        "collection": "authors",
        "collection_id": "jjxxg-g1y43",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-143819723",
        "type": "article",
        "title": "A dynamic alternative splicing program regulates gene expression during terminal erythropoiesis",
        "author": [
            {
                "family_name": "Pimentel",
                "given_name": "Harold",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Parra",
                "given_name": "Marilynn",
                "clpid": "Parra-M"
            },
            {
                "family_name": "Gee",
                "given_name": "Sherry",
                "clpid": "Gee-S"
            },
            {
                "family_name": "Ghanem",
                "given_name": "Dana",
                "clpid": "Ghanem-D"
            },
            {
                "family_name": "An",
                "given_name": "Xiuli",
                "clpid": "An-Xiuli"
            },
            {
                "family_name": "Li",
                "given_name": "Jie",
                "orcid": "0000-0002-3733-4587",
                "clpid": "Li-Jie"
            },
            {
                "family_name": "Mohandas",
                "given_name": "Narla",
                "clpid": "Mohandas-N"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Conboy",
                "given_name": "John G.",
                "clpid": "Conboy-J-G"
            }
        ],
        "abstract": "Alternative pre-messenger RNA splicing remodels the human transcriptome in a spatiotemporal manner during normal development and differentiation. Here we explored the landscape of transcript diversity in the erythroid lineage by RNA-seq analysis of five highly purified populations of morphologically distinct human erythroblasts, representing the last four cell divisions before enucleation. In this unique differentiation system, we found evidence of an extensive and dynamic alternative splicing program encompassing genes with many diverse functions. Alternative splicing was particularly enriched in genes controlling cell cycle, organelle organization, chromatin function and RNA processing. Many alternative exons exhibited differentiation-associated switches in splicing efficiency, mostly in late-stage polychromatophilic and orthochromatophilic erythroblasts, in concert with extensive cellular remodeling that precedes enucleation. A subset of alternative splicing switches introduces premature translation termination codons into selected transcripts in a differentiation stage-specific manner, supporting the hypothesis that alternative splicing-coupled nonsense-mediated decay contributes to regulation of erythroid-expressed genes as a novel part of the overall differentiation program. We conclude that a highly dynamic alternative splicing program in terminally differentiating erythroblasts plays a major role in regulating gene expression to ensure synthesis of appropriate proteome at each stage as the cells remodel in preparation for production of mature red cells.",
        "doi": "10.1093/nar/gkt1388",
        "pmcid": "PMC3973340",
        "issn": "0305-1048",
        "publisher": "Oxford University Press",
        "publication": "Nucleic Acids Research",
        "publication_date": "2014-04",
        "series_number": "6",
        "volume": "42",
        "issue": "6",
        "pages": "4031-4042"
    },
    {
        "id": "authors:v52pv-abs78",
        "collection": "authors",
        "collection_id": "v52pv-abs78",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-144424899",
        "type": "article",
        "title": "Fragment assignment in the cloud with eXpress-D",
        "author": [
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-A"
            },
            {
                "family_name": "Feng",
                "given_name": "Harvey",
                "clpid": "Feng-Harvey"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Background: Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability. \n\nResults: We present an implementation of a distributed EM solution to the fragment assignment problem using Spark, a data analytics framework that can scale by leveraging compute clusters within datacenters\u2013\"the cloud\". We demonstrate that our implementation easily scales to billions of sequenced fragments, while providing the exact maximum likelihood assignment of ambiguous fragments. The accuracy of the method is shown to be an improvement over the most widely used tools available and can be run in a constant amount of time when cluster resources are scaled linearly with the amount of input data. \n\nConclusions: The cloud offers one solution for the difficulties faced in the analysis of massive high-thoughput sequencing data, which continue to grow rapidly. Researchers in bioinformatics must follow developments in distributed systems\u2013such as new frameworks like Spark\u2013for ways to port existing methods to the cloud and help them scale to the datasets of the future. Our software, eXpress-D, is freely available at: http://github.com/adarob/express-d.",
        "doi": "10.1186/1471-2105-14-358",
        "pmcid": "PMC3881492",
        "issn": "1471-2105",
        "publisher": "BioMed Central",
        "publication": "BMC Bioinformatics",
        "publication_date": "2013-12-07",
        "volume": "14",
        "pages": "Art. No. 358"
    },
    {
        "id": "authors:55r0e-gyg95",
        "collection": "authors",
        "collection_id": "55r0e-gyg95",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-154642805",
        "type": "article",
        "title": "Updating RNA-Seq analyses after re-annotation",
        "author": [
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-A"
            },
            {
                "family_name": "Schaeffer",
                "given_name": "Lorian",
                "clpid": "Schaeffer-L-V"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The estimation of isoform abundances from RNA-Seq data requires a time-intensive step of mapping reads to either an assembled or previously annotated transcriptome, followed by an optimization procedure for deconvolution of multi-mapping reads. These procedures are essential for downstream analysis such as differential expression. In cases where it is desirable to adjust the underlying annotation, for example, on the discovery of novel isoforms or errors in existing annotations, current pipelines must be rerun from scratch. This makes it difficult to update abundance estimates after re-annotation, or to explore the effect of changes in the transcriptome on analyses. We present a novel efficient algorithm for updating abundance estimates from RNA-Seq experiments on re-annotation that does not require re-analysis of the entire dataset. Our approach is based on a fast partitioning algorithm for identifying transcripts whose abundances may depend on the added or deleted isoforms, and on a fast follow-up approach to re-estimating abundances for all transcripts. We demonstrate the effectiveness of our methods by showing how to synchronize RNA-Seq abundance estimates with the daily RefSeq incremental updates. Thus, we provide a practical approach to maintaining relevant databases of RNA-Seq derived abundance estimates even as annotations are being constantly revised.",
        "doi": "10.1093/bioinformatics/btt197",
        "pmcid": "PMC3694665",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2013-07-01",
        "series_number": "13",
        "volume": "29",
        "issue": "13",
        "pages": "1631-1637"
    },
    {
        "id": "authors:h7bt6-gaq50",
        "collection": "authors",
        "collection_id": "h7bt6-gaq50",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-162557287",
        "type": "article",
        "title": "Affine and Projective Tree Metric Theorems",
        "author": [
            {
                "family_name": "Kleinman",
                "given_name": "Aaron",
                "clpid": "Kleinman-A"
            },
            {
                "family_name": "Harel",
                "given_name": "Matan",
                "clpid": "Harel-M"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The tree metric theorem provides a combinatorial four-point condition that characterizes dissimilarity maps derived from pairwise compatible split systems. A related weaker four point condition characterizes dissimilarity maps derived from circular split systems known as Kalmanson metrics. The tree metric theorem was first discovered in the context of phylogenetics and forms the basis of many tree reconstruction algorithms, whereas Kalmanson metrics were first considered by computer scientists, and are notable in that they are a non-trivial class of metrics for which the traveling salesman problem is tractable. We present a unifying framework for these theorems based on combinatorial structures that are used for graph planarity testing. These are (projective) PC-trees, and their affine analogs, PQ-trees. In the projective case, we generalize a number of concepts from clustering theory, including hierarchies, pyramids, ultrametrics, and Robinsonian matrices, and the theorems that relate them. As with tree metrics and ultrametrics, the link between PC-trees and PQ-trees is established via the Gromov product.",
        "doi": "10.1007/s00026-012-0173-2",
        "issn": "0218-0006",
        "publisher": "Springer",
        "publication": "Annals of Combinatorics",
        "publication_date": "2013-03",
        "series_number": "1",
        "volume": "17",
        "issue": "1",
        "pages": "205-228"
    },
    {
        "id": "authors:x5dqz-e3w34",
        "collection": "authors",
        "collection_id": "x5dqz-e3w34",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-155431520",
        "type": "article",
        "title": "CGAL: computing genome assembly likelihoods",
        "author": [
            {
                "family_name": "Rahman",
                "given_name": "Atif",
                "orcid": "0000-0003-1805-3971",
                "clpid": "Rahman-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Assembly algorithms have been extensively benchmarked using simulated data so that results can be compared to ground truth. However, in de novo assembly, only crude metrics such as contig number and size are typically used to evaluate assembly quality. We present CGAL, a novel likelihood-based approach to assembly assessment in the absence of a ground truth. We show that likelihood is more accurate than other metrics currently used for evaluating assemblies, and describe its application to the optimization and comparison of assembly algorithms. Our methods are implemented in software that is freely available at http://bio.math.berkeley.edu/cgal/.",
        "doi": "10.1186/gb-2013-14-1-r8",
        "pmcid": "PMC3663106",
        "issn": "1465-6906",
        "publisher": "BioMed Central",
        "publication": "Genome Biology",
        "publication_date": "2013-01-29",
        "series_number": "1",
        "volume": "14",
        "issue": "1",
        "pages": "Art. No. R8"
    },
    {
        "id": "authors:3jby4-nxg71",
        "collection": "authors",
        "collection_id": "3jby4-nxg71",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-161532491",
        "type": "article",
        "title": "Differential analysis of gene regulation at transcript resolution with RNA-seq",
        "author": [
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Hendrickson",
                "given_name": "David G.",
                "clpid": "Hendrickson-D-G"
            },
            {
                "family_name": "Sauvageau",
                "given_name": "Martin",
                "clpid": "Sauvageau-M"
            },
            {
                "family_name": "Goff",
                "given_name": "Loyal",
                "clpid": "Goff-L-A"
            },
            {
                "family_name": "Rinn",
                "given_name": "John L.",
                "clpid": "Rinn-J-L"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Differential analysis of gene and transcript expression using high-throughput RNA sequencing (RNA-seq) is complicated by several sources of measurement variability and poses numerous statistical challenges. We present Cuffdiff 2, an algorithm that estimates expression at transcript-level resolution and controls for variability evident across replicate libraries. Cuffdiff 2 robustly identifies differentially expressed transcripts and genes and reveals differential splicing and promoter-preference changes. We demonstrate the accuracy of our approach through differential analysis of lung fibroblasts in response to loss of the developmental transcription factor HOXA1, which we show is required for lung fibroblast and HeLa cell cycle progression. Loss of HOXA1 results in significant expression level changes in thousands of individual transcripts, along with isoform switching events in key regulators of the cell cycle. Cuffdiff 2 performs robust differential analysis in RNA-seq experiments at transcript resolution, revealing a layer of regulation not readily observable with other high-throughput technologies.",
        "doi": "10.1038/nbt.2450",
        "pmcid": "PMC3869392",
        "issn": "1087-0156",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Biotechnology",
        "publication_date": "2013-01",
        "series_number": "1",
        "volume": "31",
        "issue": "1",
        "pages": "46-53"
    },
    {
        "id": "authors:fa8tb-0xm37",
        "collection": "authors",
        "collection_id": "fa8tb-0xm37",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-163300268",
        "type": "article",
        "title": "Streaming fragment assignment for real-time analysis of sequencing experiments",
        "author": [
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-Adam"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We present eXpress, a software package for efficient probabilistic assignment of ambiguously mapping sequenced fragments. eXpress uses a streaming algorithm with linear run time and constant memory use. It can determine abundances of sequenced molecules in real time and can be applied to ChIP-seq, metagenomics and other large-scale sequencing data. We demonstrate its use on RNA-seq data and show that eXpress achieves greater efficiency than other quantification methods.",
        "doi": "10.1038/nmeth.2251",
        "pmcid": "PMC3880119",
        "issn": "1548-7091",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Methods",
        "publication_date": "2013-01",
        "series_number": "1",
        "volume": "10",
        "issue": "1",
        "pages": "71-73"
    },
    {
        "id": "authors:eas5h-yyq14",
        "collection": "authors",
        "collection_id": "eas5h-yyq14",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-164056261",
        "type": "article",
        "title": "Quantifying uniformity of mapped reads",
        "author": [
            {
                "family_name": "Hower",
                "given_name": "Valerie",
                "clpid": "Hower-V"
            },
            {
                "family_name": "Starfield",
                "given_name": "Richard",
                "clpid": "Starfield-R"
            },
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-Adam"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We describe a tool for quantifying the uniformity of mapped reads in high-throughput sequencing experiments. Our statistic directly measures the uniformity of both read position and fragment length, and we explain how to compute a P-value that can be used to quantify biases arising from experimental protocols and mapping procedures. Our method is useful for comparing different protocols in experiments such as RNA-Seq. \n\nAvailability and implementation: We provide a freely available and open source python script that can be used to analyze raw read data or reads mapped to transcripts in BAM format at http://www.math.miami.edu/~vhower/ReadSpy.html",
        "doi": "10.1093/bioinformatics/bts451",
        "pmcid": "PMC3467739",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2012-10-15",
        "series_number": "20",
        "volume": "28",
        "issue": "20",
        "pages": "2680-2682"
    },
    {
        "id": "authors:x492k-9n169",
        "collection": "authors",
        "collection_id": "x492k-9n169",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-165006599",
        "type": "article",
        "title": "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks",
        "author": [
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-Adam"
            },
            {
                "family_name": "Goff",
                "given_name": "Loyal",
                "clpid": "Goff-L-A"
            },
            {
                "family_name": "Pertea",
                "given_name": "Geo",
                "clpid": "Pertea-G"
            },
            {
                "family_name": "Kim",
                "given_name": "Daehwan",
                "clpid": "Kim-Daehwan"
            },
            {
                "family_name": "Kelley",
                "given_name": "David R.",
                "clpid": "Kelley-D-R"
            },
            {
                "family_name": "Pimentel",
                "given_name": "Harold",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Salzberg",
                "given_name": "Steven L.",
                "clpid": "Salzberg-S-L"
            },
            {
                "family_name": "Rinn",
                "given_name": "John L.",
                "clpid": "Rinn-J-L"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ~1 h of hands-on time.",
        "doi": "10.1038/nprot.2012.016",
        "pmcid": "PMC3334321",
        "issn": "1754-2189",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Protocols",
        "publication_date": "2012-03",
        "series_number": "3",
        "volume": "7",
        "issue": "3",
        "pages": "562-578"
    },
    {
        "id": "authors:dnfee-gqw75",
        "collection": "authors",
        "collection_id": "dnfee-gqw75",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170303-164721967",
        "type": "article",
        "title": "A closer look at RNA editing",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Recent advances in high-throughput sequencing technology have made it possible to study RNA editing on a genome-wide scale. But realizing the potential of this approach requires stringent data analysis methods that control for genomic variation, sequencing errors and biases introduced by read-mapping procedures. In this issue, Peng et al. introduce such methods and apply them to conduct a careful, large-scale study of RNA editing in the transcriptome of a Han Chinese individual. These data provide the first reliable map of RNA edits in a person.",
        "doi": "10.1038/nbt.2156",
        "issn": "1087-0156",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Biotechnology",
        "publication_date": "2012-03",
        "series_number": "3",
        "volume": "30",
        "issue": "3",
        "pages": "246-247"
    },
    {
        "id": "authors:1bb4r-e4p39",
        "collection": "authors",
        "collection_id": "1bb4r-e4p39",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-101015596",
        "type": "article",
        "title": "Phyloepigenomic comparison of great apes reveals a correlation between somatic and germline methylation states",
        "author": [
            {
                "family_name": "Martin",
                "given_name": "David I. K.",
                "clpid": "Martin-David-I-K"
            },
            {
                "family_name": "Singer",
                "given_name": "Meromit",
                "clpid": "Singer-M"
            },
            {
                "family_name": "Dhahbi",
                "given_name": "Joseph",
                "clpid": "Dhahbi-Joseph"
            },
            {
                "family_name": "Mao",
                "given_name": "Guanxiong",
                "clpid": "Mao-Guanxiong"
            },
            {
                "family_name": "Zhang",
                "given_name": "Lu",
                "clpid": "Zhang-Lu"
            },
            {
                "family_name": "Schroth",
                "given_name": "Gary P.",
                "clpid": "Schroth-G-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Boffelli",
                "given_name": "Dario",
                "clpid": "Boffelli-D"
            }
        ],
        "abstract": "We have determined methylation state differences in the epigenomes of uncultured cells purified from human, chimpanzee, and orangutan, using digestion with a methylation-sensitive enzyme, deep sequencing, and computational analysis of the sequence data. The methylomes show a high degree of conservation, but the methylation states of approximately 10% of CpG island-like regions differ significantly between human and chimp. The differences are not associated with changes in CG content, and recapitulate the known phylogenetic relationship of the three species, indicating that they are stably maintained within each species. Inferences about the relationship between somatic and germline methylation states can be made by an analysis of CG decay, derived from methylation and sequence data. This indicates that somatic methylation states are highly related to germline states, and that the methylation differences between human and chimp have occurred in the germline. These results provide evidence for epigenetic changes that occur in the germline and distinguish closely related species, and suggest that germline epigenetic states might constrain somatic states.",
        "doi": "10.1101/gr.122721.111",
        "pmcid": "PMC3227095",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2011-12-21",
        "series_number": "12",
        "volume": "21",
        "issue": "12",
        "pages": "2049-2057"
    },
    {
        "id": "authors:fbtq8-9m762",
        "collection": "authors",
        "collection_id": "fbtq8-9m762",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-093511106",
        "type": "article",
        "title": "RNA-Seq and find: entering the RNA deep field",
        "author": [
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-Adam"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Initial high-throughput RNA sequencing (RNA-Seq) experiments have revealed a complex and dynamic transcriptome, but because it samples transcripts in proportion to their abundances, assessing the extent and nature of low-level transcription using this technique has been difficult. A new assay, RNA CaptureSeq, addresses this limitation of RNA-Seq by enriching for low-level transcripts with cDNA tiling arrays prior to high-throughput sequencing. This approach reveals a plethora of transcripts that have been previously dismissed as 'noise', and hints at single-cell transcription fingerprints that may be crucial in defining cellular function in normal and disease states.",
        "doi": "10.1186/gm290",
        "pmcid": "PMC3308029",
        "issn": "1756-994X",
        "publisher": "BioMed Central",
        "publication": "Genome Medicine",
        "publication_date": "2011-11-22",
        "series_number": "11",
        "volume": "3",
        "issue": "11",
        "pages": "Art. No. 74"
    },
    {
        "id": "authors:khkgx-4cj35",
        "collection": "authors",
        "collection_id": "khkgx-4cj35",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-095020310",
        "type": "article",
        "title": "Identification and correction of systematic error in high-throughput sequence data",
        "author": [
            {
                "family_name": "Meacham",
                "given_name": "Frazer",
                "clpid": "Meacham-F"
            },
            {
                "family_name": "Boffelli",
                "given_name": "Dario",
                "clpid": "Boffelli-D"
            },
            {
                "family_name": "Dhahbi",
                "given_name": "Joseph",
                "clpid": "Dhahbi-Joseph"
            },
            {
                "family_name": "Martin",
                "given_name": "David I. K.",
                "clpid": "Martin-David-I-K"
            },
            {
                "family_name": "Singer",
                "given_name": "Meromit",
                "clpid": "Singer-Meromit"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Background: A feature common to all DNA sequencing technologies is the presence of base-call errors in the sequenced reads. The implications of such errors are application specific, ranging from minor informatics nuisances to major problems affecting biological inferences. Recently developed \"next-gen\" sequencing technologies have greatly reduced the cost of sequencing, but have been shown to be more error prone than previous technologies. Both position specific (depending on the location in the read) and sequence specific (depending on the sequence in the read) errors have been identified in Illumina and Life Technology sequencing platforms. We describe a new type of systematic error that manifests as statistically unlikely accumulations of errors at specific genome (or transcriptome) locations.\n\nResults: We characterize and describe systematic errors using overlapping paired reads from high-coverage data. We show that such errors occur in approximately 1 in 1000 base pairs, and that they are highly replicable across experiments. We identify motifs that are frequent at systematic error sites, and describe a classifier that distinguishes heterozygous sites from systematic error. Our classifier is designed to accommodate data from experiments in which the allele frequencies at heterozygous sites are not necessarily 0.5 (such as in the case of RNA-Seq), and can be used with single-end datasets. \n\nConclusions: Systematic errors can easily be mistaken for heterozygous sites in individuals, or for SNPs in population analyses. Systematic errors are particularly problematic in low coverage experiments, or in estimates of allele-specific expression from RNA-Seq data. Our characterization of systematic error has allowed us to develop a program, called SysCall, for identifying and correcting such errors. We conclude that correction of systematic errors is important to consider in the design and interpretation of high-throughput sequencing experiments.",
        "doi": "10.1186/1471-2105-12-451",
        "pmcid": "PMC3295828",
        "issn": "1471-2105",
        "publisher": "BioMed Central",
        "publication": "BMC Bioinformatics",
        "publication_date": "2011-11-21",
        "volume": "12",
        "pages": "Art. No. 451"
    },
    {
        "id": "authors:4dgap-m4a79",
        "collection": "authors",
        "collection_id": "4dgap-m4a79",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-100428893",
        "type": "article",
        "title": "Determining Coding CpG Islands by Identifying Regions Significant for Pattern Statistics on Markov Chains",
        "author": [
            {
                "family_name": "Singer",
                "given_name": "Meromit",
                "clpid": "Singer-Meromit"
            },
            {
                "family_name": "Engstr\u00f6m",
                "given_name": "Alexander",
                "clpid": "Engstr\u00f6m-A"
            },
            {
                "family_name": "Sch\u00f6nhuth",
                "given_name": "Alexander",
                "clpid": "Sch\u00f6nhuth-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Recent experimental and computational work confirms that CpGs can be unmethylated inside coding exons, thereby showing that codons may be subjected to both genomic and epigenomic constraint. It is therefore of interest to identify coding CpG islands (CCGIs) that are regions inside exons enriched for CpGs. The difficulty in identifying such islands is that coding exons exhibit sequence biases determined by codon usage and constraints that must be taken into account. \n\nWe present a method for finding CCGIs that showcases a novel approach we have developed for identifying regions of interest that are significant (with respect to a Markov chain) for the counts of any pattern. Our method begins with the exact computation of tail probabilities for the number of CpGs in all regions contained in coding exons, and then applies a greedy algorithm for selecting islands from among the regions. We show that the greedy algorithm provably optimizes a biologically motivated criterion for selecting islands while controlling the false discovery rate. \n\nWe applied this approach to the human genome (hg18) and annotated CpG islands in coding exons. The statistical criterion we apply to evaluating islands reduces the number of false positives in existing annotations, while our approach to defining islands reveals significant numbers of undiscovered CCGIs in coding exons. Many of these appear to be examples of functional epigenetic specialization in coding exons.",
        "doi": "10.2202/1544-6115.1677",
        "issn": "2194-6302",
        "publisher": "De Gruyter",
        "publication": "Statistical Applications in Genetics and Molecular Biology",
        "publication_date": "2011-09-23",
        "series_number": "1",
        "volume": "10",
        "issue": "1",
        "pages": "Art. No. 43"
    },
    {
        "id": "authors:381cc-ttw35",
        "collection": "authors",
        "collection_id": "381cc-ttw35",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-101954304",
        "type": "article",
        "title": "Identification of novel transcripts in annotated genomes using RNA-Seq",
        "author": [
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-A"
            },
            {
                "family_name": "Pimentel",
                "given_name": "Harold",
                "clpid": "Pimentel-H"
            },
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Summary: We describe a new 'reference annotation based transcript assembly' problem for RNA-Seq data that involves assembling novel transcripts in the context of an existing annotation. This problem arises in the analysis of expression in model organisms, where it is desirable to leverage existing annotations for discovering novel transcripts. We present an algorithm for reference annotation-based transcript assembly and show how it can be used to rapidly investigate novel transcripts revealed by RNA-Seq in comparison with a reference annotation. \n\nAvailability: The methods described in this article are implemented in the Cufflinks suite of software for RNA-Seq, freely available from http://bio.math.berkeley.edu/cufflinks. The software is released under the BOOST license.",
        "doi": "10.1093/bioinformatics/btr355",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2011-09-01",
        "series_number": "17",
        "volume": "27",
        "issue": "17",
        "pages": "2325-2329"
    },
    {
        "id": "authors:kg7s1-59c84",
        "collection": "authors",
        "collection_id": "kg7s1-59c84",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-102549258",
        "type": "article",
        "title": "Tracing the Most Parsimonious Indel History",
        "author": [
            {
                "family_name": "Snir",
                "given_name": "Sagi",
                "clpid": "Snir-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Sequence alignment (the grouping of homologous bases into one column) is fundamental to almost any task in comparative genomics. This translates to positing gaps in the genomic sequences to account for events of insertions and deletions (indels). The interrelationship between sequence alignment and phylogenetic reconstruction has drawn substantial attention recently with works showing the significance of differences in alignments. One of the plausible approaches in this direction is to grade the suitability of a tree to an associated alignment and vice verse. We here present a combinatorial (as opposed to statistical) approach based on the indel history. We show\u2014both by simulations and by using real biological data from the Encyclopedia of DNA Elements (ENCODE)\u2014that this criterion is sound. The novelty of our approach is the distinguishing between insertions and deletions, and augmenting the analysis with a dimension of \"depth,\" extending it from the sequence space to the phylogenetic space. Using this approach, we perform a comprehensive study of indel characteristic behavior among mammals in both coding and non-coding regions. Our results show significant differences in indel patterns between coding and non-coding regions. We also show other characteristic patterns of indel evolution in the depth of the underlying phylogeny.",
        "doi": "10.1089/cmb.2010.0325",
        "issn": "1066-5277",
        "publisher": "Mary Ann Liebert, Inc.",
        "publication": "Journal of Computational Biology",
        "publication_date": "2011-08",
        "series_number": "8",
        "volume": "18",
        "issue": "8",
        "pages": "967-986"
    },
    {
        "id": "authors:t7zmb-bhs07",
        "collection": "authors",
        "collection_id": "t7zmb-bhs07",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-113043756",
        "type": "article",
        "title": "The neighbor-net algorithm",
        "author": [
            {
                "family_name": "Levy",
                "given_name": "Dan",
                "clpid": "Levy-D"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The neighbor-joining algorithm is a popular phylogenetics method for constructing trees from dissimilarity maps. The neighbor-net algorithm is an extension of the neighbor-joining algorithm and is used for constructing split networks. We begin by describing the output of neighbor-net in terms of the tessellation of M\u00af_0^n(R) by associahedra. This highlights the fact that neighbor-net outputs a tree in addition to a circular ordering and we explain when the neighbor-net tree is the neighbor-joining tree. A key observation is that the tree constructed in existing implementations of neighbor-net is not a neighbor-joining tree. Next, we show that neighbor-net is a greedy algorithm for finding circular split systems of minimal balanced length. This leads to an interpretation of neighbor-net as a greedy algorithm for the traveling salesman problem. The algorithm is optimal for Kalmanson matrices, from which it follows that neighbor-net is consistent and has optimal radius 12. We also provide a statistical interpretation for the balanced length for a circular split system as the length based on weighted least squares estimates of the splits. We conclude with applications of these results and demonstrate the implications of our theorems for a recently published comparison of Papuan and Austronesian languages.",
        "doi": "10.1016/j.aam.2010.09.002",
        "issn": "0196-8858",
        "publisher": "Elsevier",
        "publication": "Advances in Applied Mathematics",
        "publication_date": "2011-08",
        "series_number": "2",
        "volume": "47",
        "issue": "2",
        "pages": "240-258"
    },
    {
        "id": "authors:8ca3h-68a74",
        "collection": "authors",
        "collection_id": "8ca3h-68a74",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-104204159",
        "type": "article",
        "title": "Multiplexed RNA structure characterization with selective 2'-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq)",
        "author": [
            {
                "family_name": "Lucks",
                "given_name": "Julius B.",
                "clpid": "Lucks-J-B"
            },
            {
                "family_name": "Mortimer",
                "given_name": "Stefanie A.",
                "clpid": "Mortimer-S-A"
            },
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Luo",
                "given_name": "Shujun",
                "clpid": "Luo-Shujun"
            },
            {
                "family_name": "Aviran",
                "given_name": "Sharon",
                "clpid": "Aviran-S"
            },
            {
                "family_name": "Schroth",
                "given_name": "Gary P.",
                "clpid": "Schroth-G-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Doudna",
                "given_name": "Jennifer A.",
                "clpid": "Doudna-J-A"
            },
            {
                "family_name": "Arkin",
                "given_name": "Adam P.",
                "clpid": "Arkin-A-P"
            }
        ],
        "abstract": "New regulatory roles continue to emerge for both natural and engineered noncoding RNAs, many of which have specific secondary and tertiary structures essential to their function. Thus there is a growing need to develop technologies that enable rapid characterization of structural features within complex RNA populations. We have developed a high-throughput technique, SHAPE-Seq, that can simultaneously measure quantitative, single nucleotide-resolution secondary and tertiary structural information for hundreds of RNA molecules of arbitrary sequence. SHAPE-Seq combines selective 2\u2032-hydroxyl acylation analyzed by primer extension (SHAPE) chemistry with multiplexed paired-end deep sequencing of primer extension products. This generates millions of sequencing reads, which are then analyzed using a fully automated data analysis pipeline, based on a rigorous maximum likelihood model of the SHAPE-Seq experiment. We demonstrate the ability of SHAPE-Seq to accurately infer secondary and tertiary structural information, detect subtle conformational changes due to single nucleotide point mutations, and simultaneously measure the structures of a complex pool of different RNA molecules. SHAPE-Seq thus represents a powerful step toward making the study of RNA secondary and tertiary structures high throughput and accessible to a wide array of scientific pursuits, from fundamental biological investigations to engineering RNA for synthetic biological systems.",
        "doi": "10.1073/pnas.1106501108",
        "pmcid": "PMC3131332",
        "issn": "0027-8424",
        "publisher": "National Academy of Sciences",
        "publication": "Proceedings of the National Academy of Sciences of the United States of America",
        "publication_date": "2011-07-05",
        "series_number": "27",
        "volume": "108",
        "issue": "27",
        "pages": "11063-11068"
    },
    {
        "id": "authors:c0y2x-zwa90",
        "collection": "authors",
        "collection_id": "c0y2x-zwa90",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-103050792",
        "type": "article",
        "title": "Modeling and automation of sequencing-based characterization of RNA structure",
        "author": [
            {
                "family_name": "Aviran",
                "given_name": "Sharon",
                "clpid": "Aviran-S"
            },
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Lucks",
                "given_name": "Julius B.",
                "clpid": "Lucks-J-B"
            },
            {
                "family_name": "Mortimer",
                "given_name": "Stefanie A.",
                "clpid": "Mortimer-S-A"
            },
            {
                "family_name": "Luo",
                "given_name": "Shujun",
                "clpid": "Luo-Shujun"
            },
            {
                "family_name": "Schroth",
                "given_name": "Gary P.",
                "clpid": "Schroth-G-P"
            },
            {
                "family_name": "Doudna",
                "given_name": "Jennifer A.",
                "clpid": "Doudna-J-A"
            },
            {
                "family_name": "Arkin",
                "given_name": "Adam P.",
                "clpid": "Arkin-A-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Sequence census methods reduce molecular measurements such as transcript abundance and protein-nucleic acid interactions to counting problems via DNA sequencing. We focus on a novel assay utilizing this approach, called selective 2\u2032-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq), that can be used to characterize RNA secondary and tertiary structure. We describe a fully automated data analysis pipeline for SHAPE-Seq analysis that includes read processing, mapping, and structural inference based on a model of the experiment. Our methods rely on the solution of a series of convex optimization problems for which we develop efficient and effective numerical algorithms. Our results can be easily extended to other chemical probes of RNA structure, and also generalized to modeling polymerase drop-off in other sequence census-based experiments.",
        "doi": "10.1073/pnas.1106541108",
        "pmcid": "PMC3131376",
        "issn": "0027-8424",
        "publisher": "National Academy of Sciences",
        "publication": "Proceedings of the National Academy of Sciences of the United States of America",
        "publication_date": "2011-07-05",
        "series_number": "27",
        "volume": "108",
        "issue": "27",
        "pages": "11069-11074"
    },
    {
        "id": "authors:tkpa3-pje63",
        "collection": "authors",
        "collection_id": "tkpa3-pje63",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-105110860",
        "type": "article",
        "title": "Improving RNA-Seq expression estimates by correcting for fragment bias",
        "author": [
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-A"
            },
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Donaghey",
                "given_name": "Julie",
                "clpid": "Donaghey-J"
            },
            {
                "family_name": "Rinn",
                "given_name": "John L.",
                "clpid": "Rinn-J-L"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels, and we show how to perform the needed corrections using a likelihood based approach. We find improvements in expression estimates as measured by correlation with independently performed qRT-PCR and show that correction of bias leads to improved replicability of results across libraries and sequencing technologies.",
        "doi": "10.1186/gb-2011-12-3-r22",
        "pmcid": "PMC3129672",
        "issn": "1465-6906",
        "publisher": "BioMed Central",
        "publication": "Genome Biology",
        "publication_date": "2011-03-16",
        "series_number": "3",
        "volume": "12",
        "issue": "3",
        "pages": "Art. No. R22"
    },
    {
        "id": "authors:3k62z-3jg56",
        "collection": "authors",
        "collection_id": "3k62z-3jg56",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-111327579",
        "type": "article",
        "title": "Shape-based peak identification for ChIP-Seq",
        "author": [
            {
                "family_name": "Hower",
                "given_name": "Valerie",
                "clpid": "Hower-V"
            },
            {
                "family_name": "Evans",
                "given_name": "Steven N.",
                "clpid": "Evans-S-N"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Background: The identification of binding targets for proteins using ChIP-Seq has gained popularity as an alternative to ChIP-chip. Sequencing can, in principle, eliminate artifacts associated with microarrays, and cheap sequencing offers the ability to sequence deeply and obtain a comprehensive survey of binding. A number of algorithms have been developed to call \"peaks\" representing bound regions from mapped reads. Most current algorithms incorporate multiple heuristics, and despite much work it remains difficult to accurately determine individual peaks corresponding to distinct binding events. \n\nResults: Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is statistically sound and robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We validate our approach using previously published data and show that it can discover previously missed regions. \n\nConclusions: The difficulty in accurately calling peaks for ChIP-Seq data is partly due to the difficulty in defining peaks, and we demonstrate a novel method that improves on the accuracy of previous methods in resolving peaks. Our introduction of a robust statistical test based on ideas from topological data analysis is also novel. Our methods are implemented in a program called T-PIC (T ree shape P eak I dentification for C hIP-Seq) is available at http://bio.math.berkeley.edu/tpic/.",
        "doi": "10.1186/1471-2105-12-15",
        "pmcid": "PMC3032669",
        "issn": "1471-2105",
        "publisher": "BioMed Central",
        "publication": "BMC Bioinformatics",
        "publication_date": "2011-01-12",
        "volume": "12",
        "pages": "Art. No. 15"
    },
    {
        "id": "authors:445ny-3rj08",
        "collection": "authors",
        "collection_id": "445ny-3rj08",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-114844988",
        "type": "article",
        "title": "Development of a Low Bias Method for Characterizing Viral Populations Using Next Generation Sequencing Technology",
        "author": [
            {
                "family_name": "Willerth",
                "given_name": "Stephanie M.",
                "clpid": "Willerth-S-M"
            },
            {
                "family_name": "Pedro",
                "given_name": "H\u00e9lder A. M.",
                "clpid": "Pedro-H-A-M"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Humeau",
                "given_name": "Laurent M.",
                "clpid": "Humeau-L-M"
            },
            {
                "family_name": "Arkin",
                "given_name": "Adam P.",
                "clpid": "Arkin-A-P"
            },
            {
                "family_name": "Schaffer",
                "given_name": "David V.",
                "clpid": "Schaffer-D-V"
            }
        ],
        "abstract": "Background: With an estimated 38 million people worldwide currently infected with human immunodeficiency virus (HIV), and an additional 4.1 million people becoming infected each year, it is important to understand how this virus mutates and develops resistance in order to design successful therapies. \n\nMethodology/Principal Findings: We report a novel experimental method for amplifying full-length HIV genomes without the use of sequence-specific primers for high throughput DNA sequencing, followed by assembly of full length viral genome sequences from the resulting large dataset. Illumina was chosen for sequencing due to its ability to provide greater coverage of the HIV genome compared to prior methods, allowing for more comprehensive characterization of the heterogeneity present in the HIV samples analyzed. Our novel amplification method in combination with Illumina sequencing was used to analyze two HIV populations: a homogenous HIV population based on the canonical NL4-3 strain and a heterogeneous viral population obtained from a HIV patient's infected T cells. In addition, the resulting sequence was analyzed using a new computational approach to obtain a consensus sequence and several metrics of diversity. \n\nSignificance: This study demonstrates how a lower bias amplification method in combination with next generation DNA sequencing provides in-depth, complete coverage of the HIV genome, enabling a stronger characterization of the quasispecies present in a clinically relevant HIV population as well as future study of how HIV mutates in response to a selective pressure.",
        "doi": "10.1371/journal.pone.0013564",
        "pmcid": "PMC2962647",
        "issn": "1932-6203",
        "publisher": "Public Library of Science",
        "publication": "PLOS ONE",
        "publication_date": "2010-10-22",
        "series_number": "10",
        "volume": "5",
        "issue": "10",
        "pages": "Art. No. e13564"
    },
    {
        "id": "authors:egmr2-mf591",
        "collection": "authors",
        "collection_id": "egmr2-mf591",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-122114222",
        "type": "article",
        "title": "Coverage statistics for sequence census methods",
        "author": [
            {
                "family_name": "Evans",
                "given_name": "Steven N.",
                "clpid": "Evans-S-N"
            },
            {
                "family_name": "Hower",
                "given_name": "Valerie",
                "clpid": "Hower-V"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce a coding of the shape of the coverage depth function as a tree and explain how this can be used to detect regions with anomalous coverage. This modeling perspective is especially germane to current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. \n\nResults: Under the mild assumptions that fragment start sites are Poisson distributed and successive fragment lengths are independent and identically distributed, we observe that, regardless of fragment length distribution, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the successive jumps of the coverage function, and show that they can be encoded as a random tree that is approximately a Galton-Watson tree with generation-dependent geometric offspring distributions whose parameters can be computed. \n\nConclusions: We extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. Our approach leads to explicit determinations of the null distributions of certain test statistics, while for others it greatly simplifies the approximation of their null distributions by simulation. Our focus on fragments also leads to a new approach to visualizing sequencing data that is of independent interest.",
        "doi": "10.1186/1471-2105-11-430",
        "pmcid": "PMC2940910",
        "issn": "1471-2105",
        "publisher": "BioMed Central",
        "publication": "BMC Bioinformatics",
        "publication_date": "2010-08-18",
        "volume": "11",
        "pages": "Art. No. 430"
    },
    {
        "id": "authors:vzhec-44b69",
        "collection": "authors",
        "collection_id": "vzhec-44b69",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-121310345",
        "type": "article",
        "title": "MetMap Enables Genome-Scale Methyltyping for Determining Methylation States in Populations",
        "author": [
            {
                "family_name": "Singer",
                "given_name": "Meromit",
                "clpid": "Singer-M"
            },
            {
                "family_name": "Boffelli",
                "given_name": "Dario",
                "clpid": "Boffelli-D"
            },
            {
                "family_name": "Dhahbi",
                "given_name": "Joseph",
                "clpid": "Dhahbi-Joseph"
            },
            {
                "family_name": "Sch\u00f6nhuth",
                "given_name": "Alexander",
                "clpid": "Sch\u00f6nhuth-A"
            },
            {
                "family_name": "Schroth",
                "given_name": "Gary P.",
                "clpid": "Schroth-G-P"
            },
            {
                "family_name": "Martin",
                "given_name": "David I. K.",
                "clpid": "Martin-David-I-K"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The ability to assay genome-scale methylation patterns using high-throughput sequencing makes it possible to carry out association studies to determine the relationship between epigenetic variation and phenotype. While bisulfite sequencing can determine a methylome at high resolution, cost inhibits its use in comparative and population studies. MethylSeq, based on sequencing of fragment ends produced by a methylation-sensitive restriction enzyme, is a method for methyltyping (survey of methylation states) and is a site-specific and cost-effective alternative to whole-genome bisulfite sequencing. Despite its advantages, the use of MethylSeq has been restricted by biases in MethylSeq data that complicate the determination of methyltypes. Here we introduce a statistical method, MetMap, that produces corrected site-specific methylation states from MethylSeq experiments and annotates unmethylated islands across the genome. MetMap integrates genome sequence information with experimental data, in a statistically sound and cohesive Bayesian Network. It infers the extent of methylation at individual CGs and across regions, and serves as a framework for comparative methylation analysis within and among species. We validated MetMap's inferences with direct bisulfite sequencing, showing that the methylation status of sites and islands is accurately inferred. We used MetMap to analyze MethylSeq data from four human neutrophil samples, identifying novel, highly unmethylated islands that are invisible to sequence-based annotation strategies. The combination of MethylSeq and MetMap is a powerful and cost-effective tool for determining genome-scale methyltypes suitable for comparative and association studies.",
        "doi": "10.1371/journal.pcbi.1000888",
        "pmcid": "PMC2924245",
        "issn": "1553-7358",
        "publisher": "Public Library of Science",
        "publication": "PLOS Computational Biology",
        "publication_date": "2010-08",
        "series_number": "8",
        "volume": "6",
        "issue": "8",
        "pages": "Art. No. e1000888"
    },
    {
        "id": "authors:1sq2y-r7n90",
        "collection": "authors",
        "collection_id": "1sq2y-r7n90",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-123004736",
        "type": "article",
        "title": "Exploring the Genetic Basis of Variation in Gene Predictions with a Synthetic Association Study",
        "author": [
            {
                "family_name": "Levin",
                "given_name": "Tera C.",
                "clpid": "Levin-T-C"
            },
            {
                "family_name": "Glazer",
                "given_name": "Andrew M.",
                "clpid": "Glazer-A-M"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Brem",
                "given_name": "Rachel B.",
                "clpid": "Brem-R-B"
            },
            {
                "family_name": "Eisen",
                "given_name": "Michael B.",
                "orcid": "0000-0002-7528-738X",
                "clpid": "Eisen-M-B"
            }
        ],
        "abstract": "Identifying DNA polymorphisms that affect molecular processes like transcription, splicing, or translation typically requires genotyping and experimentally characterizing tissue from large numbers of individuals, which remains expensive and time consuming. Here we introduce an alternative strategy: a \"synthetic association study\" in which we computationally predict molecular phenotypes on artificial genomes containing randomly sampled combinations of polymorphic alleles, and perform a classical association study to identify genotypes underlying variation in these computationally predicted annotations. We applied this method to characterize the effects on gene structure of 32,792 single-nucleotide polymorphisms between two strains of the antibiotic producing fungus Penicilium chrysogenum. Although these SNPs represent only 0.1 percent of the nucleotides in the genome, they collectively altered 1.8 percent of predicted gene models between these strains. To determine which SNPs or combinations of SNPs were responsible for this variation, we predicted protein-coding genes in 500 intermediate genomes, each identical except for randomly chosen alleles at each SNP position. Of 30,468 gene models in the genome, 557 varied across these 500 genomes. 226 of these polymorphic gene models (40%) were perfectly correlated with individual SNPs, all of which were within or immediately proximal to the affected gene. The genetic architectures of the other 321 were more complex, with several examples of SNP epistasis that would have been difficult to predict a priori. We expect that many of the SNPs that affect computational gene structure reflect a biologically unrealistic sensitivity of the gene prediction algorithm to sequence changes, and we propose that genome annotation algorithms could be improved by minimizing their sensitivity to natural polymorphisms. However, many of the SNPs we identified are likely to affect transcript structure in vivo, and the synthetic association study approach can be easily generalized to any computed genome annotation to uncover relationships between genotype and important molecular phenotypes.",
        "doi": "10.1371/journal.pone.0011645",
        "pmcid": "PMC2912228",
        "issn": "1932-6203",
        "publisher": "Public Library of Science",
        "publication": "PLOS ONE",
        "publication_date": "2010-07-29",
        "series_number": "7",
        "volume": "5",
        "issue": "7",
        "pages": "Art. No. e11645"
    },
    {
        "id": "authors:q1bwm-dea54",
        "collection": "authors",
        "collection_id": "q1bwm-dea54",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-123614363",
        "type": "article",
        "title": "Exon-Level Microarray Analyses Identify Alternative Splicing Programs in Breast Cancer",
        "author": [
            {
                "family_name": "Lapuk",
                "given_name": "Anna",
                "clpid": "Lapuk-A"
            },
            {
                "family_name": "Marr",
                "given_name": "Henry",
                "clpid": "Marr-H"
            },
            {
                "family_name": "Jakkula",
                "given_name": "Lakshmi",
                "clpid": "Jakkula-L"
            },
            {
                "family_name": "Pedro",
                "given_name": "Helder",
                "clpid": "Pedro-H-A-M"
            },
            {
                "family_name": "Bhattacharya",
                "given_name": "Sanchita",
                "clpid": "Bhattacharya-S"
            },
            {
                "family_name": "Purdom",
                "given_name": "Elizabeth",
                "clpid": "Purdom-E"
            },
            {
                "family_name": "Hu",
                "given_name": "Zhi",
                "clpid": "Hu-Zhi"
            },
            {
                "family_name": "Simpson",
                "given_name": "Ken",
                "clpid": "Simpson-K"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Durinck",
                "given_name": "Steffen",
                "clpid": "Durinck-S"
            },
            {
                "family_name": "Wang",
                "given_name": "Nicholas",
                "clpid": "Wang-Nicholas"
            },
            {
                "family_name": "Parvin",
                "given_name": "Bahram",
                "clpid": "Parvin-B"
            },
            {
                "family_name": "Fontenay",
                "given_name": "Gerald",
                "clpid": "Fontenay-G"
            },
            {
                "family_name": "Speed",
                "given_name": "Terence",
                "clpid": "Speed-T"
            },
            {
                "family_name": "Garbe",
                "given_name": "James",
                "clpid": "Garbe-J-C"
            },
            {
                "family_name": "Stampfer",
                "given_name": "Martha",
                "clpid": "Stampfer-Martha"
            },
            {
                "family_name": "Bayandorian",
                "given_name": "Hovig",
                "clpid": "Bayandorian-H"
            },
            {
                "family_name": "Dorton",
                "given_name": "Shannon",
                "clpid": "Dorton-S"
            },
            {
                "family_name": "Clark",
                "given_name": "Tyson A.",
                "clpid": "Clark-T-A"
            },
            {
                "family_name": "Schweitzer",
                "given_name": "Anthony",
                "clpid": "Schweitzer-A"
            },
            {
                "family_name": "Wyrobek",
                "given_name": "Andrew",
                "clpid": "Wyrobek-A"
            },
            {
                "family_name": "Feiler",
                "given_name": "Heidi",
                "clpid": "Feller-H"
            },
            {
                "family_name": "Spellman",
                "given_name": "Paul",
                "clpid": "Spellman-P"
            },
            {
                "family_name": "Conboy",
                "given_name": "John",
                "clpid": "Conboy-J-G"
            },
            {
                "family_name": "Gray",
                "given_name": "Joe W.",
                "clpid": "Gray-J-W"
            }
        ],
        "abstract": "Protein isoforms produced by alternative splicing (AS) of many genes have been implicated in several aspects of cancer genesis and progression. These observations motivated a genome-wide assessment of AS in breast cancer. We accomplished this by measuring exon level expression in 31 breast cancer and nonmalignant immortalized cell lines representing luminal, basal, and claudin-low breast cancer subtypes using Affymetrix Human Junction Arrays. We analyzed these data using a computational pipeline specifically designed to detect AS with a low false-positive rate. This identified 181 splice events representing 156 genes as candidates for AS. Reverse transcription-PCR validation of a subset of predicted AS events confirmed 90%. Approximately half of the AS events were associated with basal, luminal, or claudin-low breast cancer subtypes. Exons involved in claudin-low subtype\u2013specific AS were significantly associated with the presence of evolutionarily conserved binding motifs for the tissue-specific Fox2 splicing factor. Small interfering RNA knockdown of Fox2 confirmed the involvement of this splicing factor in subtype-specific AS. The subtype-specific AS detected in this study likely reflects the splicing pattern in the breast cancer progenitor cells in which the tumor arose and suggests the utility of assays for Fox-mediated AS in cancer subtype definition and early detection. These data also suggest the possibility of reducing the toxicity of protein-targeted breast cancer treatments by targeting protein isoforms that are not present in limiting normal tissues.",
        "doi": "10.1158/1541-7786.MCR-09-0528",
        "pmcid": "PMC2911965",
        "issn": "1541-7786",
        "publisher": "American Association for Cancer Research",
        "publication": "Molecular Cancer Research",
        "publication_date": "2010-07",
        "series_number": "7",
        "volume": "8",
        "issue": "7",
        "pages": "961-974"
    },
    {
        "id": "authors:wvkv3-hv456",
        "collection": "authors",
        "collection_id": "wvkv3-hv456",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20100601-111602154",
        "type": "article",
        "title": "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation",
        "author": [
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Williams",
                "given_name": "Brian A.",
                "clpid": "Williams-B-A"
            },
            {
                "family_name": "Pertea",
                "given_name": "Geo",
                "clpid": "Pertea-G"
            },
            {
                "family_name": "Mortazavi",
                "given_name": "Ali",
                "orcid": "0000-0002-4259-6362",
                "clpid": "Mortazavi-A"
            },
            {
                "family_name": "Kwan",
                "given_name": "Gordon",
                "clpid": "Kwan-Gordon"
            },
            {
                "family_name": "van Baren",
                "given_name": "Marijke J.",
                "clpid": "van-Baren-M-J"
            },
            {
                "family_name": "Salzberg",
                "given_name": "Steven L.",
                "clpid": "Salzberg-S-L"
            },
            {
                "family_name": "Wold",
                "given_name": "Barbara J.",
                "orcid": "0000-0003-3235-8130",
                "clpid": "Wold-B-J"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed &gt;430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription start site (TSS) or splice isoform, and we observed more subtle shifts in 1,304 other genes. These results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.",
        "doi": "10.1038/nbt.1621",
        "pmcid": "PMC3146043",
        "issn": "1087-0156",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Biotechnology",
        "publication_date": "2010-05",
        "series_number": "5",
        "volume": "28",
        "issue": "5",
        "pages": "511-515"
    },
    {
        "id": "authors:y0yeq-t0p17",
        "collection": "authors",
        "collection_id": "y0yeq-t0p17",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-125457377",
        "type": "article",
        "title": "Binding Site Turnover Produces Pervasive Quantitative Changes in Transcription Factor Binding between Closely Related Drosophila Species",
        "author": [
            {
                "family_name": "Bradley",
                "given_name": "Robert K.",
                "clpid": "Bradley-R-K"
            },
            {
                "family_name": "Li",
                "given_name": "Xiao-Yong",
                "clpid": "Li-Xiao-Yong"
            },
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Davidson",
                "given_name": "Stuart",
                "clpid": "Davidson-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Chu",
                "given_name": "Hou Cheng",
                "clpid": "Chu-Hou-Cheng"
            },
            {
                "family_name": "Tonkin",
                "given_name": "Leath A.",
                "clpid": "Tonkin-L-A"
            },
            {
                "family_name": "Biggin",
                "given_name": "Mark D.",
                "clpid": "Biggin-M-D"
            },
            {
                "family_name": "Eisen",
                "given_name": "Michael B.",
                "orcid": "0000-0002-7528-738X",
                "clpid": "Eisen-M-B"
            }
        ],
        "abstract": "Changes in gene expression play an important role in evolution, yet the molecular mechanisms underlying regulatory evolution are poorly understood. Here we compare genome-wide binding of the six transcription factors that initiate segmentation along the anterior-posterior axis in embryos of two closely related species: Drosophila melanogaster and Drosophila yakuba. Where we observe binding by a factor in one species, we almost always observe binding by that factor to the orthologous sequence in the other species. Levels of binding, however, vary considerably. The magnitude and direction of the interspecies differences in binding levels of all six factors are strongly correlated, suggesting a role for chromatin or other factor-independent forces in mediating the divergence of transcription factor binding. Nonetheless, factor-specific quantitative variation in binding is common, and we show that it is driven to a large extent by the gain and loss of cognate recognition sequences for the given factor. We find only a weak correlation between binding variation and regulatory function. These data provide the first genome-wide picture of how modest levels of sequence divergence between highly morphologically similar species affect a system of coordinately acting transcription factors during animal development, and highlight the dominant role of quantitative variation in transcription factor binding over short evolutionary distances.",
        "doi": "10.1371/journal.pbio.1000343",
        "pmcid": "PMC2843597",
        "issn": "1545-7885",
        "publisher": "Public Library of Science",
        "publication": "PLoS Biology",
        "publication_date": "2010-03",
        "series_number": "3",
        "volume": "8",
        "issue": "3",
        "pages": "Art. No. e1000343"
    },
    {
        "id": "authors:2jf89-zme75",
        "collection": "authors",
        "collection_id": "2jf89-zme75",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-131319458",
        "type": "article",
        "title": "Disordered Microbial Communities in Asthmatic Airways",
        "author": [
            {
                "family_name": "Hilty",
                "given_name": "Markus",
                "clpid": "Hilty-Markus"
            },
            {
                "family_name": "Burke",
                "given_name": "Conor",
                "clpid": "Burke-Conor"
            },
            {
                "family_name": "Pedro",
                "given_name": "Helder",
                "clpid": "Pedro-Helder-A-M"
            },
            {
                "family_name": "Cardenas",
                "given_name": "Paul",
                "clpid": "Cardenas-Paul"
            },
            {
                "family_name": "Bush",
                "given_name": "Andy",
                "clpid": "Bush-Andy"
            },
            {
                "family_name": "Bossley",
                "given_name": "Cara",
                "clpid": "Bossley-Cara"
            },
            {
                "family_name": "Davies",
                "given_name": "Jane",
                "orcid": "0000-0002-4108-4357",
                "clpid": "Davies-Jane"
            },
            {
                "family_name": "Ervine",
                "given_name": "Aaron",
                "clpid": "Ervine-Aaron"
            },
            {
                "family_name": "Poulter",
                "given_name": "Len",
                "clpid": "Poulter-Len"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Moffatt",
                "given_name": "Miriam F.",
                "clpid": "Moffatt-Miriam-F"
            },
            {
                "family_name": "Cookson",
                "given_name": "William O. C.",
                "clpid": "Cookson-William-O-C"
            }
        ],
        "abstract": "Background: A rich microbial environment in infancy protects against asthma [1], [2] and infections precipitate asthma exacerbations [3]. We compared the airway microbiota at three levels in adult patients with asthma, the related condition of COPD, and controls. We also studied bronchial lavage from asthmatic children and controls. \n\nPrincipal Findings: We identified 5,054 16S rRNA bacterial sequences from 43 subjects, detecting &gt;70% of species present. The bronchial tree was not sterile, and contained a mean of 2,000 bacterial genomes per cm2 surface sampled. Pathogenic Proteobacteria, particularly Haemophilus spp., were much more frequent in bronchi of adult asthmatics or patients with COPD than controls. We found similar highly significant increases in Proteobacteria in asthmatic children. Conversely, Bacteroidetes, particularly Prevotella spp., were more frequent in controls than adult or child asthmatics or COPD patients. \n\nSignificance: The results show the bronchial tree to contain a characteristic microbiota, and suggest that this microbiota is disturbed in asthmatic airways.",
        "doi": "10.1371/journal.pone.0008578",
        "pmcid": "PMC2798952",
        "issn": "1932-6203",
        "publisher": "Public Library of Science",
        "publication": "PLOS ONE",
        "publication_date": "2010-01-05",
        "series_number": "1",
        "volume": "5",
        "issue": "1",
        "pages": "Art. No. e8578"
    },
    {
        "id": "authors:fbdkt-n0b22",
        "collection": "authors",
        "collection_id": "fbdkt-n0b22",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-133921221",
        "type": "article",
        "title": "Convex Rank Tests and Semigraphoids",
        "author": [
            {
                "family_name": "Morton",
                "given_name": "Jason",
                "clpid": "Morton-J"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Shiu",
                "given_name": "Anne",
                "clpid": "Shiu-Anne"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            },
            {
                "family_name": "Wienand",
                "given_name": "Oliver",
                "clpid": "Wienand-O"
            }
        ],
        "abstract": "Convex rank tests are partitions of the symmetric group which have desirable geometric properties. The statistical tests defined by such partitions involve counting all permutations in the equivalence classes. Each class consists of the linear extensions of a partially ordered set specified by data. Our methods refine existing rank tests of nonparametric statistics, such as the sign test and the runs test, and are useful for exploratory analysis of ordinal data. We establish a bijection between convex rank tests and probabilistic conditional independence structures known as semigraphoids. The subclass of submodular rank tests is derived from faces of the cone of submodular functions or from Minkowski summands of the permutohedron. We enumerate all small instances of such rank tests. Of particular interest are graphical tests, which correspond to both graphical models and to graph associahedra.",
        "doi": "10.1137/080715822",
        "issn": "0895-4801",
        "publisher": "Society for Industrial and Applied Mathematics",
        "publication": "SIAM Journal on Discrete Mathematics",
        "publication_date": "2009-07-10",
        "series_number": "3",
        "volume": "23",
        "issue": "3",
        "pages": "1117-1134"
    },
    {
        "id": "authors:6d8hx-6w537",
        "collection": "authors",
        "collection_id": "6d8hx-6w537",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-141357019",
        "type": "article",
        "title": "TopHat: discovering splice junctions with RNA-Seq",
        "author": [
            {
                "family_name": "Trapnell",
                "given_name": "Cole",
                "clpid": "Trapnell-C"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Salzberg",
                "given_name": "Steven L.",
                "clpid": "Salzberg-S-L"
            }
        ],
        "abstract": "Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. \n\nResults: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development. \n\nAvailability: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu",
        "doi": "10.1093/bioinformatics/btp120",
        "pmcid": "PMC2672628",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2009-05-01",
        "series_number": "9",
        "volume": "25",
        "issue": "9",
        "pages": "1105-1111"
    },
    {
        "id": "authors:8c606-67x34",
        "collection": "authors",
        "collection_id": "8c606-67x34",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-094400293",
        "type": "article",
        "title": "Why neighbor-joining works",
        "author": [
            {
                "family_name": "Mihaescu",
                "given_name": "Radu",
                "clpid": "Mihaescu-R"
            },
            {
                "family_name": "Levy",
                "given_name": "Dan",
                "clpid": "Levy-D"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We show that the neighbor-joining algorithm is a robust quartet method for constructing trees from distances. This leads to a new performance guarantee that contains Atteson's optimal radius bound as a special case and explains many cases where neighbor-joining is successful even when Atteson's criterion is not satisfied. We also provide a proof for Atteson's conjecture on the optimal edge radius of the neighbor-joining algorithm. The strong performance guarantees we provide also hold for the quadratic time fast neighbor-joining algorithm, thus providing a theoretical basis for inferring very large phylogenies with neighbor-joining.",
        "doi": "10.1007/s00453-007-9116-4",
        "issn": "0178-4617",
        "publisher": "Springer",
        "publication": "Algorithmica",
        "publication_date": "2009-05",
        "series_number": "1",
        "volume": "54",
        "issue": "1",
        "pages": "1-24"
    },
    {
        "id": "authors:rbmj2-r9288",
        "collection": "authors",
        "collection_id": "rbmj2-r9288",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-135830452",
        "type": "article",
        "title": "Fast Statistical Alignment",
        "author": [
            {
                "family_name": "Bradley",
                "given_name": "Robert K.",
                "clpid": "Bradley-R-K"
            },
            {
                "family_name": "Roberts",
                "given_name": "Adam",
                "clpid": "Roberts-A"
            },
            {
                "family_name": "Smoot",
                "given_name": "Michael",
                "clpid": "Smoot-M"
            },
            {
                "family_name": "Juvekar",
                "given_name": "Sudeep",
                "clpid": "Juvekar-S"
            },
            {
                "family_name": "Do",
                "given_name": "Jaeyoung",
                "clpid": "Do-Jaeyoung"
            },
            {
                "family_name": "Dewey",
                "given_name": "Colin",
                "clpid": "Dewey-C-N"
            },
            {
                "family_name": "Holmes",
                "given_name": "Ian",
                "clpid": "Holmes-I"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment\u2014previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches\u2014yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.",
        "doi": "10.1371/journal.pcbi.1000392",
        "pmcid": "PMC2684580",
        "issn": "1553-7358",
        "publisher": "Public Library of Science",
        "publication": "PLOS Computational Biology",
        "publication_date": "2009-05",
        "series_number": "5",
        "volume": "5",
        "issue": "5",
        "pages": "Art. No. e1000392"
    },
    {
        "id": "authors:64b05-1m385",
        "collection": "authors",
        "collection_id": "64b05-1m385",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-103825678",
        "type": "article",
        "title": "Specific alignment of structured RNA: stochastic grammars and sequence annealing",
        "author": [
            {
                "family_name": "Bradley",
                "given_name": "Robert K.",
                "clpid": "Bradley-R-K"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Holmes",
                "given_name": "Ian",
                "clpid": "Holmes-I"
            }
        ],
        "abstract": "Motivation: Whole-genome screens suggest that eukaryotic genomes are dense with non-coding RNAs (ncRNAs). We introduce a novel approach to RNA multiple alignment which couples a generative probabilistic model of sequence and structure with an efficient sequence annealing approach for exploring the space of multiple alignments. This leads to a new software program, Stemloc-AMA, that is both accurate and specific in the alignment of multiple related RNA sequences. \n\nResults: When tested on the benchmark datasets BRalibase II and BRalibase 2.1, Stemloc-AMA has comparable sensitivity to and better specificity than the best competing methods. We use a large-scale random sequence experiment to show that while most alignment programs maximize sensitivity at the expense of specificity, even to the point of giving complete alignments of non-homologous sequences, Stemloc-AMA aligns only sequences with detectable homology and leaves unrelated sequences largely unaligned. Such accurate and specific alignments are crucial for comparative-genomics analysis, from inferring phylogeny to estimating substitution rates across different lineages. \n\nAvailability: Stemloc-AMA is available from http://biowiki.org/StemLocAMA as part of the dart software package for sequence analysis.",
        "doi": "10.1093/bioinformatics/btn495",
        "pmcid": "PMC2732270",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2008-12-01",
        "series_number": "23",
        "volume": "24",
        "issue": "23",
        "pages": "2677-2683"
    },
    {
        "id": "authors:newbg-mqk69",
        "collection": "authors",
        "collection_id": "newbg-mqk69",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-144249240",
        "type": "article",
        "title": "Combinatorics of least squares trees",
        "author": [
            {
                "family_name": "Mihaescu",
                "given_name": "Radu",
                "clpid": "Mihaescu-R"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "A recurring theme in the least squares approach to phylogenetics has been the discovery of elegant combinatorial formulas for the least squares estimates of edge lengths. These formulas have proved useful for the development of efficient algorithms, and have also been important for understanding connections among popular phylogeny algorithms. For example, the selection criterion of the neighbor-joining algorithm is now understood in terms of the combinatorial formulas of Pauplin for estimating tree length. We highlight a phylogenetically desirable property that weighted least squares methods should satisfy, and provide a complete characterization of methods that satisfy the property. The necessary and sufficient condition is a multiplicative four point condition that the the variance matrix needs to satisfy. The proof is based on the  observation that the Lagrange multipliers in the proof of the Gauss\u2013Markov theorem are tree-additive. Our results generalize and complete previous work on ordinary least squares, balanced minimum evolution and the taxon weighted variance model. They also provide a time optimal algorithm for computation.",
        "doi": "10.1073/pnas.0802089105",
        "pmcid": "PMC2533170",
        "issn": "0027-8424",
        "publisher": "National Academy of Sciences",
        "publication": "Proceedings of the National Academy of Sciences of the United States of America",
        "publication_date": "2008-09-09",
        "series_number": "36",
        "volume": "105",
        "issue": "36",
        "pages": "13206-13211"
    },
    {
        "id": "authors:xmecw-c0d24",
        "collection": "authors",
        "collection_id": "xmecw-c0d24",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-104539290",
        "type": "article",
        "title": "Comparison of Pattern Detection Methods in Microarray Time Series of the Segmentation Clock",
        "author": [
            {
                "family_name": "Dequ\u00e9ant",
                "given_name": "Mary-Lee",
                "clpid": "Dequ\u00e9ant-M-L"
            },
            {
                "family_name": "Ahnert",
                "given_name": "Sebastian",
                "clpid": "Ahnert-S"
            },
            {
                "family_name": "Edelsbrunner",
                "given_name": "Herbert",
                "clpid": "Edelsbrunner-H"
            },
            {
                "family_name": "Fink",
                "given_name": "Thomas M. A.",
                "clpid": "Fink-T-M-A"
            },
            {
                "family_name": "Glynn",
                "given_name": "Earl F.",
                "clpid": "Glynn-E-F"
            },
            {
                "family_name": "Hattem",
                "given_name": "Gaye",
                "clpid": "Hattem-G"
            },
            {
                "family_name": "Kudlicki",
                "given_name": "Andrzej",
                "clpid": "Kudlicki-A"
            },
            {
                "family_name": "Mileyko",
                "given_name": "Yuriy",
                "clpid": "Mileyko-Y"
            },
            {
                "family_name": "Morton",
                "given_name": "Jason",
                "clpid": "Morton-J"
            },
            {
                "family_name": "Mushegian",
                "given_name": "Arcady R.",
                "clpid": "Mushegian-A-R"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Rowicka",
                "given_name": "Maga",
                "clpid": "Rowicka-M"
            },
            {
                "family_name": "Shiu",
                "given_name": "Anne",
                "clpid": "Shiu-Anne"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            },
            {
                "family_name": "Pourqui\u00e9",
                "given_name": "Olivier",
                "clpid": "Pourqui\u00e9-O"
            }
        ],
        "abstract": "While genome-wide gene expression data are generated at an increasing rate, the repertoire of approaches for pattern discovery in these data is still limited. Identifying subtle patterns of interest in large amounts of data (tens of thousands of profiles) associated with a certain level of noise remains a challenge. A microarray time series was recently generated to study the transcriptional program of the mouse segmentation clock, a biological oscillator associated with the periodic formation of the segments of the body axis. A method related to Fourier analysis, the Lomb-Scargle periodogram, was used to detect periodic profiles in the dataset, leading to the identification of a novel set of cyclic genes associated with the segmentation clock. Here, we applied to the same microarray time series dataset four distinct mathematical methods to identify significant patterns in gene expression profiles. These methods are called: Phase consistency, Address reduction, Cyclohedron test and Stable persistence, and are based on different conceptual frameworks that are either hypothesis- or data-driven. Some of the methods, unlike Fourier transforms, are not dependent on the assumption of periodicity of the pattern of interest. Remarkably, these methods identified blindly the expression profiles of known cyclic genes as the most significant patterns in the dataset. Many candidate genes predicted by more than one approach appeared to be true positive cyclic genes and will be of particular interest for future research. In addition, these methods predicted novel candidate cyclic genes that were consistent with previous biological knowledge and experimental validation in mouse embryos. Our results demonstrate the utility of these novel pattern detection strategies, notably for detection of periodic profiles, and suggest that combining several distinct mathematical approaches to analyze microarray datasets is a valuable strategy for identifying genes that exhibit novel, interesting transcriptional patterns.",
        "doi": "10.1371/journal.pone.0002856",
        "pmcid": "PMC2481401",
        "issn": "1932-6203",
        "publisher": "Public Library of Science",
        "publication": "PLOS ONE",
        "publication_date": "2008-08-06",
        "series_number": "8",
        "volume": "3",
        "issue": "8",
        "pages": "Art. No. e2856"
    },
    {
        "id": "authors:pc0jh-45q03",
        "collection": "authors",
        "collection_id": "pc0jh-45q03",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-111036850",
        "type": "article",
        "title": "Combining statistical alignment and phylogenetic footprinting to detect regulatory elements",
        "author": [
            {
                "family_name": "Satija",
                "given_name": "Rahul",
                "clpid": "Satija-R"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Hein",
                "given_name": "Jotun",
                "clpid": "Hein-J"
            }
        ],
        "abstract": "Motivation: Traditional alignment-based phylogenetic footprinting approaches make predictions on the basis of a single assumed alignment. The predictions are therefore highly sensitive to alignment errors or regions of alignment uncertainty. Alternatively, statistical alignment methods provide a framework for performing phylogenetic analyses by examining a distribution of alignments. \n\nResults: We developed a novel algorithm for predicting functional elements by combining statistical alignment and phylogenetic footprinting (SAPF). SAPF simultaneously performs both alignment and annotation by combining phylogenetic footprinting techniques with an hidden Markov model (HMM) transducer-based multiple alignment model, and can analyze sequence data from multiple sequences. We assessed SAPF's predictive performance on two simulated datasets and three well-annotated cis-regulatory modules from newly sequenced Drosophila genomes. The results demonstrate that removing the traditional dependence on a single alignment can significantly augment the predictive performance, especially when there is uncertainty in the alignment of functional regions. \n\nAvailability: SAPF is freely available to download online at http://www.stats.ox.ac.uk/~satija/SAPF/",
        "doi": "10.1093/bioinformatics/btn104",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2008-05-15",
        "series_number": "10",
        "volume": "24",
        "issue": "10",
        "pages": "1236-1242"
    },
    {
        "id": "authors:096a7-06d87",
        "collection": "authors",
        "collection_id": "096a7-06d87",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-133352205",
        "type": "article",
        "title": "Viral Population Estimation Using Pyrosequencing",
        "author": [
            {
                "family_name": "Tesler",
                "given_name": "Glenn",
                "clpid": "Tesler-G"
            },
            {
                "family_name": "Eriksson",
                "given_name": "Nicholas",
                "clpid": "Eriksson-N"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Mitsuya",
                "given_name": "Yumi",
                "clpid": "Mitsuya-Yumi"
            },
            {
                "family_name": "Rhee",
                "given_name": "Soo-Yon",
                "clpid": "Rhee-Soo-Yon"
            },
            {
                "family_name": "Wang",
                "given_name": "Chunlin",
                "clpid": "Wang-Chunlin"
            },
            {
                "family_name": "Gharizadeh",
                "given_name": "Baback",
                "clpid": "Gharizadeh-B"
            },
            {
                "family_name": "Ronaghi",
                "given_name": "Mostafa",
                "clpid": "Ronaghi-M"
            },
            {
                "family_name": "Shafer",
                "given_name": "Robert W.",
                "clpid": "Shafer-R-W"
            },
            {
                "family_name": "Beerenwinkel",
                "given_name": "Niko",
                "clpid": "Beerenwinkel-N"
            }
        ],
        "abstract": "The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate-based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug-resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an expectation\u2013maximization (EM) algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.",
        "doi": "10.1371/journal.pcbi.1000074",
        "pmcid": "PMC2323617",
        "issn": "1553-7358",
        "publisher": "Public Library of Science",
        "publication": "PLoS Computational Biology",
        "publication_date": "2008-05",
        "series_number": "5",
        "volume": "4",
        "issue": "5",
        "pages": "Art. No. e1000074"
    },
    {
        "id": "authors:mdrpa-46960",
        "collection": "authors",
        "collection_id": "mdrpa-46960",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-143033087",
        "type": "article",
        "title": "On the optimality of the neighbor-joining algorithm",
        "author": [
            {
                "family_name": "Eickmeyer",
                "given_name": "Kord",
                "clpid": "Eickmeyer-K"
            },
            {
                "family_name": "Huggins",
                "given_name": "Peter",
                "clpid": "Huggins-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Yoshida",
                "given_name": "Ruriko",
                "clpid": "Yoshida-Ruriko"
            }
        ],
        "abstract": "The popular neighbor-joining (NJ) algorithm used in phylogenetics is a greedy algorithm for finding the balanced minimum evolution (BME) tree associated to a dissimilarity map. From this point of view, NJ is \"optimal\" when the algorithm outputs the tree which minimizes the balanced minimum evolution criterion. We use the fact that the NJ tree topology and the BME tree topology are determined by polyhedral subdivisions of the spaces of dissimilarity maps \u211b^(^n _2)_+ to study the optimality of the neighbor-joining algorithm. In particular, we investigate and compare the polyhedral subdivisions for n \u2264 8. This requires the measurement of volumes of spherical polytopes in high dimension, which we obtain using a combination of Monte Carlo methods and polyhedral algorithms. Our results include a demonstration that highly unrelated trees can be co-optimal in BME reconstruction, and that NJ regions are not convex. We obtain the l_2 radius for neighbor-joining for n = 5 and we conjecture that the ability of the neighbor-joining algorithm to recover the BME tree depends on the diameter of the BME tree.",
        "doi": "10.1186/1748-7188-3-5",
        "pmcid": "PMC2430562",
        "issn": "1748-7188",
        "publisher": "BioMed Central",
        "publication": "Algorithms for Molecular Biology",
        "publication_date": "2008-04-30",
        "volume": "3",
        "pages": "Art. No. 5"
    },
    {
        "id": "authors:crp2q-p2j11",
        "collection": "authors",
        "collection_id": "crp2q-p2j11",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-112926954",
        "type": "article",
        "title": "Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures",
        "author": [
            {
                "family_name": "Stark",
                "given_name": "Alexander",
                "clpid": "Stark-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or 'evolutionary signatures', dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.",
        "doi": "10.1038/nature06340",
        "pmcid": "PMC2474711",
        "issn": "0028-0836",
        "publisher": "Nature Publishing Group",
        "publication": "Nature",
        "publication_date": "2007-11-08",
        "series_number": "7167",
        "volume": "450",
        "issue": "7167",
        "pages": "219-232"
    },
    {
        "id": "authors:04wc6-q2j80",
        "collection": "authors",
        "collection_id": "04wc6-q2j80",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-113900780",
        "type": "article",
        "title": "Evolution of genes and genomes on the Drosophila phylogeny",
        "author": [
            {
                "family_name": "Clark",
                "given_name": "Andrew G.",
                "clpid": "Clark-A-G"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "literal": "Drosophila 12 Genomes Consortium"
            }
        ],
        "abstract": "Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species.",
        "doi": "10.1038/nature06341",
        "issn": "0028-0836",
        "publisher": "Nature Publishing Group",
        "publication": "Nature",
        "publication_date": "2007-11-08",
        "series_number": "7167",
        "volume": "450",
        "issue": "7167",
        "pages": "203-218"
    },
    {
        "id": "authors:yenc3-ndk14",
        "collection": "authors",
        "collection_id": "yenc3-ndk14",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-095818473",
        "type": "article",
        "title": "Towards the Human Genotope",
        "author": [
            {
                "family_name": "Huggins",
                "given_name": "Peter",
                "clpid": "Huggins-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            }
        ],
        "abstract": "The human genotope is the convex hull of all allele frequency vectors that can be obtained from the genotypes present in the human population. In this paper, we take a few initial steps toward a description of this object, which may be fundamental for future population based genetics studies. Here we use data from the HapMap Project, restricted to two ENCODE regions, to study a subpolytope of the human genotope. We study three different approaches for obtaining informative low-dimensional projections of this subpolytope. The projections are specified by projection onto few tag SNPs, principal component analysis, and archetypal analysis. We describe the application of our geometric approach to identifying structure in populations based on single nucleotide polymorphisms.",
        "doi": "10.1007/s11538-007-9244-7",
        "issn": "0092-8240",
        "publisher": "Springer",
        "publication": "Bulletin of Mathematical Biology",
        "publication_date": "2007-11",
        "series_number": "8",
        "volume": "69",
        "issue": "8",
        "pages": "2723-2735"
    },
    {
        "id": "authors:qj3nn-jb436",
        "collection": "authors",
        "collection_id": "qj3nn-jb436",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-114846354",
        "type": "article",
        "title": "Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans",
        "author": [
            {
                "family_name": "Begun",
                "given_name": "David J.",
                "clpid": "Begun-David-J"
            },
            {
                "family_name": "Holloway",
                "given_name": "Alisha K.",
                "clpid": "Holloway-Alisha-K"
            },
            {
                "family_name": "Stevens",
                "given_name": "Kristian",
                "clpid": "Stevens-Kristian"
            },
            {
                "family_name": "Hillier",
                "given_name": "LaDeana W.",
                "clpid": "Hillier-LaDeana-W"
            },
            {
                "family_name": "Poh",
                "given_name": "Yu-Ping",
                "clpid": "Poh-Yu-Ping"
            },
            {
                "family_name": "Hahn",
                "given_name": "Matthew W.",
                "clpid": "Hahn-Matthew-W"
            },
            {
                "family_name": "Nista",
                "given_name": "Phillip M.",
                "clpid": "Nista-Phillip-M"
            },
            {
                "family_name": "Jones",
                "given_name": "Corbin D.",
                "clpid": "Jones-Corbin-D"
            },
            {
                "family_name": "Kern",
                "given_name": "Andrew D.",
                "clpid": "Kern-Andrew-D"
            },
            {
                "family_name": "Dewey",
                "given_name": "Colin N.",
                "clpid": "Dewey-Colin-N"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Myers",
                "given_name": "Eugene",
                "clpid": "Myers-Eugene-W"
            },
            {
                "family_name": "Langley",
                "given_name": "Charles H.",
                "clpid": "Langley-Charles-H"
            }
        ],
        "abstract": "The population genetic perspective is that the processes shaping genomic variation can be revealed only through simultaneous investigation of sequence polymorphism and divergence within and between closely related species. Here we present a population genetic analysis of Drosophila simulans based on whole-genome shotgun sequencing of multiple inbred lines and comparison of the resulting data to genome assemblies of the closely related species, D. melanogaster and D. yakuba. We discovered previously unknown, large-scale fluctuations of polymorphism and divergence along chromosome arms, and significantly less polymorphism and faster divergence on the X chromosome. We generated a comprehensive list of functional elements in the D. simulans genome influenced by adaptive evolution. Finally, we characterized genomic patterns of base composition for coding and noncoding sequence. These results suggest several new hypotheses regarding the genetic and biological mechanisms controlling polymorphism and divergence across the Drosophila genome, and provide a rich resource for the investigation of adaptive evolution and functional variation in D. simulans.",
        "doi": "10.1371/journal.pbio.0050310",
        "pmcid": "PMC2062478",
        "issn": "1545-7885",
        "publisher": "Public Library of Science",
        "publication": "PLoS Biology",
        "publication_date": "2007-11",
        "series_number": "11",
        "volume": "5",
        "issue": "11",
        "pages": "Art. No. e310"
    },
    {
        "id": "authors:znssb-fyb07",
        "collection": "authors",
        "collection_id": "znssb-fyb07",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-094958898",
        "type": "article",
        "title": "Epistasis and Shapes of Fitness Landscapes",
        "author": [
            {
                "family_name": "Beerenwinkel",
                "given_name": "Niko",
                "clpid": "Beerenwinkel-N"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            }
        ],
        "abstract": "The relationship between the shape of a fitness landscape and the underlying gene interactions, or epistasis, has been extensively studied in the two-locus case. Gene interactions among multiple loci are usually reduced to two-way interactions. We present a geometric theory of shapes of fitness landscapes for multiple loci. A central concept is the genotope, which is the convex hull of all possible allele frequencies in populations. Triangulations of the genotope correspond to different shapes of fitness landscapes and reveal all the gene interactions. The theory is applied to fitness data from HIV and Drosophila melanogaster. In both cases, our findings refine earlier analyses and reveal previously undetected gene interactions.",
        "doi": "10.48550/arXiv.0603034",
        "issn": "1017-0405",
        "publisher": "Institute of Statistical Science, Academia Sinica",
        "publication": "Statistica Sinica",
        "publication_date": "2007-10",
        "series_number": "4",
        "volume": "17",
        "issue": "4",
        "pages": "1317-1342"
    },
    {
        "id": "authors:47a83-v6n34",
        "collection": "authors",
        "collection_id": "47a83-v6n34",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170306-154012533",
        "type": "article",
        "title": "The Cyclohedron Test for Finding Periodic Genes in Time Course Expression Studies",
        "author": [
            {
                "family_name": "Morton",
                "given_name": "Jason",
                "clpid": "Morton-J"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Shiu",
                "given_name": "Anne",
                "clpid": "Shiu-Anne"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            }
        ],
        "abstract": "The problem of finding periodically expressed genes from time course microarray experiments is at the center of numerous efforts to identify the molecular components of biological clocks. We present a new approach to this problem based on the cyclohedron test, which is a rank test inspired by recent advances in algebraic combinatorics. The test has the advantage of being robust to measurement errors, and can be used to ascertain the significance of top-ranked genes. We apply the test to recently published measurements of gene expression during mouse somitogenesis and find 32 genes that collectively are significant. Among these are previously identified periodic genes involved in the Notch/FGF and Wnt signaling pathways, as well as novel candidate genes that may play a role in regulating the segmentation clock. These results confirm that there are an abundance of exceptionally periodic genes expressed during somitogenesis. The emphasis of this paper is on the statistics and combinatorics that underlie the cyclohedron test and its implementation within a multiple testing framework.",
        "doi": "10.2202/1544-6115.1286",
        "issn": "2194-6302",
        "publisher": "De Gruyter",
        "publication": "Statistical Applications in Genetics and Molecular Biology",
        "publication_date": "2007-08",
        "volume": "6",
        "pages": "Art. No. 21"
    },
    {
        "id": "authors:q89db-2cr95",
        "collection": "authors",
        "collection_id": "q89db-2cr95",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-130932831",
        "type": "article",
        "title": "Patterns of gene duplication and intron loss in the ENCODE regions suggest a confounding factor",
        "author": [
            {
                "family_name": "Chatterji",
                "given_name": "Sourav",
                "clpid": "Chatterji-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The exon\u2013intron structure of eukaryotic genes allows for phenomena such as alternative splicing, nonsense-mediated decay, and regulation through untranslated regions. However, the evolution of the exon structure of genes is not well elucidated because of limited and phylogenetically sparse data sets. In this study, we use the phylogenetically diverse sequencing of the ENCODE regions to study gene structure evolution in mammalian genomes. This first phylogenetically diverse study of gene structure changes offers insights into the mode and tempo of mammalian gene structure evolution. The genes undergoing structure changes appear to be moderately to highly expressed in germline cells and show levels of selection similar to those of other ENCODE genes. Patterns of gene duplication of the affected genes are more complex than expected. The number of sampled genomes is sufficiently dense to infer that certain gene duplications happened after intron loss. Thus, although gene duplication is highly correlated with intron loss, we conclude that structural changes in genes are not necessarily due to a loss of constraint following gene duplication as previously suggested.",
        "doi": "10.1016/j.ygeno.2007.03.008",
        "pmcid": "PMC2034525",
        "issn": "0888-7543",
        "publisher": "Elsevier",
        "publication": "Genomics",
        "publication_date": "2007-07",
        "series_number": "1",
        "volume": "90",
        "issue": "1",
        "pages": "44-48"
    },
    {
        "id": "authors:0dpmy-1bc16",
        "collection": "authors",
        "collection_id": "0dpmy-1bc16",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-121815071",
        "type": "article",
        "title": "Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project",
        "author": [
            {
                "family_name": "Birney",
                "given_name": "Ewan",
                "clpid": "Birney-E"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "literal": "ENCODE Project Consortium"
            }
        ],
        "abstract": "We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.",
        "doi": "10.1038/nature05874",
        "pmcid": "PMC2212820",
        "issn": "0028-0836",
        "publisher": "Nature Publishing Group",
        "publication": "Nature",
        "publication_date": "2007-06-14",
        "series_number": "7146",
        "volume": "447",
        "issue": "7146",
        "pages": "799-816"
    },
    {
        "id": "authors:jz1bx-4kd75",
        "collection": "authors",
        "collection_id": "jz1bx-4kd75",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-125526948",
        "type": "article",
        "title": "Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome",
        "author": [
            {
                "family_name": "Margulies",
                "given_name": "Elliott H.",
                "clpid": "Margulies-E-H"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "literal": "ENCODE Project Consortium"
            }
        ],
        "abstract": "A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.",
        "doi": "10.1101/gr.6034307",
        "pmcid": "PMC1891336",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2007-06",
        "series_number": "6",
        "volume": "17",
        "issue": "6",
        "pages": "760-774"
    },
    {
        "id": "authors:tmz64-b0883",
        "collection": "authors",
        "collection_id": "tmz64-b0883",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-131612759",
        "type": "article",
        "title": "Interpreting the unculturable majority",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "New methods are necessary for the analysis and interpretation of massive amounts of metagenomic data.",
        "doi": "10.1038/nmeth0607-479",
        "issn": "1548-7091",
        "publisher": "Nature Publishing Group",
        "publication": "Nature Methods",
        "publication_date": "2007-06",
        "series_number": "6",
        "volume": "4",
        "issue": "6",
        "pages": "479-480"
    },
    {
        "id": "authors:1wmws-xht65",
        "collection": "authors",
        "collection_id": "1wmws-xht65",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-131928305",
        "type": "article",
        "title": "Analysis of epistatic interactions and fitness landscapes using a new geometric approach",
        "author": [
            {
                "family_name": "Beerenwinkel",
                "given_name": "Niko",
                "clpid": "Beerenwinkel-N"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            },
            {
                "family_name": "Elena",
                "given_name": "Santiago F.",
                "orcid": "0000-0001-8249-5593",
                "clpid": "Elena-S-F"
            },
            {
                "family_name": "Lenski",
                "given_name": "Richard E.",
                "orcid": "0000-0002-1064-8375",
                "clpid": "Lenski-R-E"
            }
        ],
        "abstract": "Background: Understanding interactions between mutations and how they affect fitness is a central problem in evolutionary biology that bears on such fundamental issues as the structure of fitness landscapes and the evolution of sex. To date, analyses of fitness landscapes have focused either on the overall directional curvature of the fitness landscape or on the distribution of pairwise interactions. In this paper, we propose and employ a new mathematical approach that allows a more complete description of multi-way interactions and provides new insights into the structure of fitness landscapes. \n\nResults: We apply the mathematical theory of gene interactions developed by Beerenwinkel et al. to a fitness landscape for Escherichia coli obtained by Elena and Lenski. The genotypes were constructed by introducing nine mutations into a wild-type strain and constructing a restricted set of 27 double mutants. Despite the absence of mutants higher than second order, our analysis of this genotypic space points to previously unappreciated gene interactions, in addition to the standard pairwise epistasis. Our analysis confirms Elena and Lenski's inference that the fitness landscape is complex, so that an overall measure of curvature obscures a diversity of interaction types. We also demonstrate that some mutations contribute disproportionately to this complexity. In particular, some mutations are systematically better than others at mixing with other mutations. We also find a strong correlation between epistasis and the average fitness loss caused by deleterious mutations. In particular, the epistatic deviations from multiplicative expectations tend toward more positive values in the context of more deleterious mutations, emphasizing that pairwise epistasis is a local property of the fitness landscape. Finally, we determine the geometry of the fitness landscape, which reflects many of these biologically interesting features. \n\nConclusion: A full description of complex fitness landscapes requires more information than the average curvature or the distribution of independent pairwise interactions. We have proposed a mathematical approach that, in principle, allows a complete description and, in practice, can suggest new insights into the structure of real fitness landscapes. Our analysis emphasizes the value of non-independent genotypes for these inferences.",
        "doi": "10.1186/1471-2148-7-60",
        "pmcid": "PMC1865543",
        "issn": "1471-2148",
        "publisher": "BioMed Central",
        "publication": "BMC Evolutionary Biology",
        "publication_date": "2007-04-13",
        "volume": "7",
        "pages": "Art. No. 60"
    },
    {
        "id": "authors:737t0-a8613",
        "collection": "authors",
        "collection_id": "737t0-a8613",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-085456375",
        "type": "article",
        "title": "The Mathematics of Phylogenomics",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            }
        ],
        "abstract": "The grand challenges in biology today are being shaped by powerful high\u2010throughput technologies that have revealed the genomes of many organisms, global expression patterns of genes, and detailed information about variation within populations. We are therefore able to ask, for the first time, fundamental questions about the evolution of genomes, the structure of genes and their regulation, and the connections between genotypes and phenotypes of individuals. The answers to these questions are all predicated on progress in a variety of computational, statistical, and mathematical fields. The rapid growth in the characterization of genomes has led to the advancement of a new discipline called phylogenomics. This discipline results from the combination of two major fields in the life sciences: genomics, i.e., the study of the function and structure of genes and genomes; and molecular phylogenetics, i.e., the study of the hierarchical evolutionary relationships among organisms and their genomes. The objective of this article is to offer mathematicians a first introduction to this emerging field, and to discuss specific mathematical problems and developments arising from phylogenomics.",
        "doi": "10.1137/050632634",
        "issn": "0036-1445",
        "publisher": "Society for Industrial and Applied Mathematics",
        "publication": "SIAM Review",
        "publication_date": "2007-01-30",
        "series_number": "1",
        "volume": "49",
        "issue": "1",
        "pages": "3-31"
    },
    {
        "id": "authors:pmyqk-7z667",
        "collection": "authors",
        "collection_id": "pmyqk-7z667",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-134453947",
        "type": "article",
        "title": "Multiple alignment by sequence annealing",
        "author": [
            {
                "family_name": "Schwartz",
                "given_name": "Ariel S.",
                "clpid": "Schwartz-A-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "MOTIVATION: We introduce a novel approach to multiple alignment that is based on an algorithm for rapidly checking whether single matches are consistent with a partial multiple alignment. This leads to a sequence annealing algorithm, which is an incremental method for building multiple sequence alignments one match at a time. Our approach improves significantly on the standard progressive alignment approach to multiple alignment. \n\nRESULTS: The sequence annealing algorithm performs well on benchmark test sets of protein sequences. It is not only sensitive, but also specific, drastically reducing the number of incorrectly aligned residues in comparison to other programs. The method allows for adjustment of the sensitivity/specificity tradeoff and can be used to reliably identify homologous regions among protein sequences. \n\nAVAILABILITY: An implementation of the sequence annealing algorithm is available at http://bio.math.berkeley.edu/amap/",
        "doi": "10.1093/bioinformatics/btl311",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2007-01-15",
        "series_number": "2",
        "volume": "23",
        "issue": "2",
        "pages": "e24-e29"
    },
    {
        "id": "authors:kk2y6-qk074",
        "collection": "authors",
        "collection_id": "kk2y6-qk074",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-090954418",
        "type": "article",
        "title": "Parametric Alignment of Drosophila Genomes",
        "author": [
            {
                "family_name": "Dewey",
                "given_name": "Colin N.",
                "clpid": "Dewey-C-N"
            },
            {
                "family_name": "Huggins",
                "given_name": "Peter M.",
                "clpid": "Huggins-P-M"
            },
            {
                "family_name": "Woods",
                "given_name": "Kevin",
                "clpid": "Woods-K"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The classic algorithms of Needleman\u2013Wunsch and Smith\u2013Waterman find a maximum a posteriori probability alignment for a pair hidden Markov model (PHMM). To process large genomes that have undergone complex genome rearrangements, almost all existing whole genome alignment methods apply fast heuristics to divide genomes into small pieces that are suitable for Needleman\u2013Wunsch alignment. In these alignment methods, it is standard practice to fix the parameters and to produce a single alignment for subsequent analysis by biologists. As the number of alignment programs applied on a whole genome scale continues to increase, so does the disagreement in their results. The alignments produced by different programs vary greatly, especially in non-coding regions of eukaryotic genomes where the biologically correct alignment is hard to find. Parametric alignment is one possible remedy. This methodology resolves the issue of robustness to changes in parameters by finding all optimal alignments for all possible parameters in a PHMM. Our main result is the construction of a whole genome parametric alignment of Drosophila melanogaster and Drosophila pseudoobscura. This alignment draws on existing heuristics for dividing whole genomes into small pieces for alignment, and it relies on advances we have made in computing convex polytopes that allow us to parametrically align non-coding regions using biologically realistic models. We demonstrate the utility of our parametric alignment for biological inference by showing that cis-regulatory elements are more conserved between Drosophila melanogaster and Drosophila pseudoobscura than previously thought. We also show how whole genome parametric alignment can be used to quantitatively assess the dependence of branch length estimates on alignment parameters.",
        "doi": "10.1371/journal.pcbi.0020073",
        "pmcid": "PMC1480539",
        "issn": "1553-734X",
        "publisher": "Public Library of Science",
        "publication": "PLoS Computational Biology",
        "publication_date": "2006-06",
        "series_number": "6",
        "volume": "2",
        "issue": "6",
        "pages": "Art. No. e73"
    },
    {
        "id": "authors:9rzxt-bad89",
        "collection": "authors",
        "collection_id": "9rzxt-bad89",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-162320251",
        "type": "article",
        "title": "Evolution at the nucleotide level: the problem of multiple whole-genome alignment",
        "author": [
            {
                "family_name": "Dewey",
                "given_name": "Colin N.",
                "clpid": "Dewey-C-N"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "With the genome sequences of numerous species at hand, we have the opportunity to discover how evolution has acted at each and every nucleotide in our genome. To this end, we must identify sets of nucleotides that have descended from a common ancestral nucleotide. The problem of identifying evolutionary-related nucleotides is that of sequence alignment. When the sequences under consideration are entire genomes, we have the problem of multiple whole-genome alignment. In this paper, we first state a series of definitions for homology and its subrelations between single nucleotides. Within this framework, we review the current methods available for the alignment of multiple large genomes. We then describe a subset of tools that make biological inferences from multiple whole-genome alignments.",
        "doi": "10.1093/hmg/ddl056",
        "issn": "0964-6906",
        "publisher": "Oxford University Press",
        "publication": "Human Molecular Genetics",
        "publication_date": "2006-04-15",
        "series_number": "Suppl. 1",
        "volume": "15",
        "issue": "Suppl. 1",
        "pages": "R51-R56"
    },
    {
        "id": "authors:4g2cj-jb389",
        "collection": "authors",
        "collection_id": "4g2cj-jb389",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-162728814",
        "type": "article",
        "title": "Reference based annotation with GeneMapper",
        "author": [
            {
                "family_name": "Chatterji",
                "given_name": "Sourav",
                "clpid": "Chatterji-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We introduce GeneMapper, a program for transferring annotations from a well annotated genome to other genomes. Drawing on high quality curated annotations, GeneMapper enables rapid and accurate annotation of newly sequenced genomes and is suitable for both finished and draft genomes. GeneMapper uses a profile based approach for mapping genes into multiple species, improving upon the standard pairwise approach. GeneMapper is freely available for academic use.",
        "doi": "10.1186/gb-2006-7-4-r29",
        "pmcid": "PMC1557983",
        "issn": "1465-6906",
        "publisher": "BioMed Central",
        "publication": "Genome Biology",
        "publication_date": "2006-04-05",
        "series_number": "4",
        "volume": "7",
        "issue": "4",
        "pages": "Art. No. R29"
    },
    {
        "id": "authors:rfdxr-1as30",
        "collection": "authors",
        "collection_id": "rfdxr-1as30",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-164849681",
        "type": "article",
        "title": "A Genome-Wide Map of Conserved MicroRNA Targets in C. elegans",
        "author": [
            {
                "family_name": "Lall",
                "given_name": "Sabbi",
                "clpid": "Lall-S"
            },
            {
                "family_name": "Gr\u00fcn",
                "given_name": "Dominic",
                "clpid": "Gr\u00fcn-D"
            },
            {
                "family_name": "Krek",
                "given_name": "Azra",
                "clpid": "Krek-A"
            },
            {
                "family_name": "Chen",
                "given_name": "Kevin",
                "clpid": "Chen-Kevin-Bio"
            },
            {
                "family_name": "Wang",
                "given_name": "Yi-Lu",
                "clpid": "Wang-Yi-Lu"
            },
            {
                "family_name": "Dewey",
                "given_name": "Colin N.",
                "clpid": "Dewey-C-N"
            },
            {
                "family_name": "Sood",
                "given_name": "Pranidhi",
                "clpid": "Sood-P"
            },
            {
                "family_name": "Colombo",
                "given_name": "Teresa",
                "clpid": "Colombo-T"
            },
            {
                "family_name": "Bray",
                "given_name": "Nicolas",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "MacMenamin",
                "given_name": "Philip",
                "clpid": "MacMenamin-P"
            },
            {
                "family_name": "Kao",
                "given_name": "Huey-Ling",
                "clpid": "Kao-Huey-Ling"
            },
            {
                "family_name": "Gunsalus",
                "given_name": "Kristin C.",
                "clpid": "Gunsalus-K-C"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Piano",
                "given_name": "Fabio",
                "clpid": "Piano-F"
            },
            {
                "family_name": "Rajewsky",
                "given_name": "Nikolaus",
                "clpid": "Rajewsky-N"
            }
        ],
        "abstract": "Background: Metazoan miRNAs regulate protein-coding genes by binding the 3\u2032 UTR of cognate mRNAs. Identifying targets for the 115 known C. elegans miRNAs is essential for understanding their function. \n\nResults: By using a new version of PicTar and sequence alignments of three nematodes, we predict that miRNAs regulate at least 10% of C. elegans genes through conserved interactions. We have developed a new experimental pipeline to assay 3\u2032 UTR-mediated posttranscriptional gene regulation via an endogenous reporter expression system amenable to high-throughput cloning, demonstrating the utility of this system using one of the most intensely studied miRNAs, let-7. Our expression analyses uncover several new potential let-7 targets and suggest a new let-7 activity in head muscle and neurons. To explore genome-wide trends in miRNA function, we analyzed functional categories of predicted target genes, finding that one-third of C. elegans miRNAs target gene sets are enriched for specific functional annotations. We have also integrated miRNA target predictions with other functional genomic data from C. elegans. \n\nConclusions: At least 10% of C. elegans genes are predicted miRNA targets, and a number of nematode miRNAs seem to regulate biological processes by targeting functionally related genes. We have also developed and successfully utilized an in vivo system for testing miRNA target predictions in likely endogenous expression domains. The thousands of genome-wide miRNA target predictions for nematodes, humans, and flies are available from the PicTar website and are linked to an accessible graphical network-browsing tool allowing exploration of miRNA target predictions in the context of various functional genomic data resources.",
        "doi": "10.1016/j.cub.2006.01.050",
        "issn": "0960-9822",
        "publisher": "Cell Press",
        "publication": "Current Biology",
        "publication_date": "2006-03-07",
        "series_number": "5",
        "volume": "16",
        "issue": "5",
        "pages": "460-471"
    },
    {
        "id": "authors:1hgt0-34a86",
        "collection": "authors",
        "collection_id": "1hgt0-34a86",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-164418033",
        "type": "article",
        "title": "Beyond Pairwise Distances: Neighbor-Joining with Phylogenetic Diversity Estimates",
        "author": [
            {
                "family_name": "Levy",
                "given_name": "Dan",
                "clpid": "Levy-D"
            },
            {
                "family_name": "Yoshida",
                "given_name": "Ruriko",
                "clpid": "Yoshida-Ruriko"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The \"neighbor-joining algorithm\" is a recursive procedure for reconstructing trees that is based on a transformation of pairwise distances between leaves. We present a generalization of the neighbor-joining transformation, which uses estimates of phylogenetic diversity rather than pairwise distances in the tree. This leads to an improved neighbor-joining algorithm whose total running time is still polynomial in the number of taxa. On simulated data, the method outperforms other distance-based methods. We have implemented neighbor-joining for subtree weights in a program called MJOIN which is freely available under the Gnu Public License at http://bio.math.berkeley.edu/mjoin/.",
        "doi": "10.1093/molbev/msj059",
        "issn": "0737-4038",
        "publisher": "Oxford University Press",
        "publication": "Molecular Biology and Evolution",
        "publication_date": "2006-03",
        "series_number": "3",
        "volume": "23",
        "issue": "3",
        "pages": "491-498"
    },
    {
        "id": "authors:y9ss2-0wq61",
        "collection": "authors",
        "collection_id": "y9ss2-0wq61",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-111109017",
        "type": "article",
        "title": "Identification of transposable elements using multiple alignments of related genomes",
        "author": [
            {
                "family_name": "Caspi",
                "given_name": "Anat",
                "orcid": "0000-0001-8702-8273",
                "clpid": "Caspi-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Accurate genome-wide cataloging of transposable elements (TEs) will facilitate our understanding of mobile DNA evolution, expose the genomic effects of TEs on the host genome, and improve the quality of assembled genomes. Using the availability of several nearly complete Drosophila genomes and developments in whole genome alignment methods, we introduce a large-scale comparative method for identifying repetitive mobile DNA regions. These regions are highly enriched for transposable elements. Our method has two main features distinguishing it from other repeat-finding methods. First, rather than relying on sequence similarity to determine the location of repeats, the genomic artifacts of the transposition mechanism itself are systematically tracked in the context of multiple alignments. Second, we can derive bounds on the age of each repeat instance based on the phylogenetic species tree. We report results obtained using both complete and draft sequences of four closely related Drosophila genomes and validate our results with manually curated TE annotations in the Drosophila melanogaster euchromatin. We show the utility of our findings in exploring both transposable elements and their host genomes: In the study of TEs, we offer predictions for novel families, annotate new insertions of known families, and show data that support the hypothesis that all known TE families in D. melanogaster were recently active; in the study of the host, we show how our findings can be used to determine shifts in the eu-heterochromatin junction in the pericentric chromosome regions.",
        "doi": "10.1101/gr.4361206",
        "pmcid": "PMC1361722",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2006-02",
        "series_number": "2",
        "volume": "16",
        "issue": "2",
        "pages": "260-270"
    },
    {
        "id": "authors:yf5ce-tav61",
        "collection": "authors",
        "collection_id": "yf5ce-tav61",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-125857130",
        "type": "article",
        "title": "Large Multiple Organism Gene Finding by Collapsed Gibbs Sampling",
        "author": [
            {
                "family_name": "Chatterji",
                "given_name": "Sourav",
                "clpid": "Chatterji-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The Gibbs sampling method has been widely used for sequence analysis after it was successfully applied to the problem of identifying regulatory motif sequences upstream of genes. Since then, numerous variants of the original idea have emerged: however, in all cases the application has been to finding short motifs in collections of short sequences (typically less than 100 nucleotides long). In this paper, we introduce a Gibbs sampling approach for identifying genes in multiple large genomic sequences up to hundreds of kilobases long. This approach leverages the evolutionary relationships between the sequences to improve the gene predictions, without explicitly aligning the sequences. We have applied our method to the analysis of genomic sequence from 14 genomic regions, totaling roughly 1.8 Mb of sequence in each organism. We show that our approach compares favorably with existing ab initio approaches to gene finding, including pairwise comparison based gene prediction methods which make explicit use of alignments. Furthermore, excellent performance can be obtained with as little as four organisms, and the method overcomes a number of difficulties of previous comparison based gene finding approaches: it is robust with respect to genomic rearrangements, can work with draft sequence, and is fast (linear in the number and length of the sequences). It can also be seamlessly integrated with Gibbs sampling motif detection methods.",
        "doi": "10.1089/cmb.2005.12.599",
        "issn": "1066-5277",
        "publisher": "Mary Ann Liebert, Inc.",
        "publication": "Journal of Computational Biology",
        "publication_date": "2005-07",
        "series_number": "6",
        "volume": "12",
        "issue": "6",
        "pages": "599-608"
    },
    {
        "id": "authors:czhmv-7p908",
        "collection": "authors",
        "collection_id": "czhmv-7p908",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-124940796",
        "type": "article",
        "title": "Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities",
        "author": [
            {
                "family_name": "Chen",
                "given_name": "Kevin",
                "clpid": "Chen-Kevin-Bio"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The application of whole-genome shotgun sequencing to microbial communities represents a major development in metagenomics, the study of uncultured microbes via the tools of modern genomic analysis. In the past year, whole-genome shotgun sequencing projects of prokaryotic communities from an acid mine biofilm, the Sargasso Sea, Minnesota farm soil, three deep-sea whale falls, and deep-sea sediments have been reported, adding to previously published work on viral communities from marine and fecal samples. The interpretation of this new kind of data poses a wide variety of exciting and difficult bioinformatics problems. The aim of this review is to introduce the bioinformatics community to this emerging field by surveying existing techniques and promising new approaches for several of the most interesting of these computational problems.",
        "doi": "10.1371/journal.pcbi.0010024",
        "pmcid": "PMC1185649",
        "issn": "1553-734X",
        "publisher": "Public Library of Science",
        "publication": "PLoS Computational Biology",
        "publication_date": "2005-07",
        "series_number": "2",
        "volume": "1",
        "issue": "2",
        "pages": "Art. No. e24"
    },
    {
        "id": "authors:dejyt-dqe25",
        "collection": "authors",
        "collection_id": "dejyt-dqe25",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20190503-150942109",
        "type": "article",
        "title": "Subtree power analysis and species selection for comparative genomics",
        "author": [
            {
                "family_name": "McAuliffe",
                "given_name": "Jon D.",
                "clpid": "McAuliffe-J-D"
            },
            {
                "family_name": "Jordan",
                "given_name": "Michael I.",
                "clpid": "Jordan-M-I"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites. Analysis of a phylogenetic star topology shows theoretically that the optimal species subset is not in general the most evolutionarily diverged subset. We then demonstrate this finding empirically in a study of vertebrate species. Our results suggest that marsupials are prime sequencing candidates.",
        "doi": "10.1073/pnas.0502790102",
        "pmcid": "PMC1142384",
        "issn": "0027-8424",
        "publisher": "National Academy of Sciences",
        "publication": "Proceedings of the National Academy of Sciences of the United States of America",
        "publication_date": "2005-05-31",
        "series_number": "22",
        "volume": "102",
        "issue": "22",
        "pages": "7900-7905"
    },
    {
        "id": "authors:nbtgs-bmj74",
        "collection": "authors",
        "collection_id": "nbtgs-bmj74",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-090057219",
        "type": "article",
        "title": "Subtree power analysis finds optimal species for comparative genomics",
        "author": [
            {
                "family_name": "McAuliffe",
                "given_name": "Jon D.",
                "clpid": "McAuliffe-J-D"
            },
            {
                "family_name": "Jordan",
                "given_name": "Michael I.",
                "clpid": "Jordan-M-I"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites. Analysis of a phylogenetic star topology shows theoretically that the optimal species subset is not in general the most evolutionarily diverged subset. We then demonstrate this finding empirically in a study of vertebrate species. Our results suggest that marsupials are prime sequencing candidates.",
        "doi": "10.1073/pnas.0502790102",
        "pmcid": "PMC1142384",
        "issn": "0027-8424",
        "publisher": "National Academy of Sciences",
        "publication": "Proceedings of the National Academy of Sciences of the United States of America",
        "publication_date": "2005-05-31",
        "series_number": "22",
        "volume": "102",
        "issue": "22",
        "pages": "7900-7905"
    },
    {
        "id": "authors:tr080-6rv26",
        "collection": "authors",
        "collection_id": "tr080-6rv26",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-130340353",
        "type": "article",
        "title": "Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution",
        "author": [
            {
                "family_name": "Hillier",
                "given_name": "LaDeana W.",
                "clpid": "Hillier-L-W"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "literal": "International Chicken Genome Sequencing Consortium"
            }
        ],
        "abstract": "We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome\u2014composed of approximately one billion base pairs of sequence and an estimated 20,000\u201323,000 genes\u2014provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.",
        "doi": "10.1038/nature03154",
        "issn": "0028-0836",
        "publisher": "Nature Publishing Group",
        "publication": "Nature",
        "publication_date": "2004-12-09",
        "series_number": "7018",
        "volume": "432",
        "issue": "7018",
        "pages": "695-716"
    },
    {
        "id": "authors:q9x6k-rkc63",
        "collection": "authors",
        "collection_id": "q9x6k-rkc63",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-131651119",
        "type": "article",
        "title": "Intraspecies sequence comparisons for annotating genomes",
        "author": [
            {
                "family_name": "Boffelli",
                "given_name": "Dario",
                "clpid": "Boffelli-D"
            },
            {
                "family_name": "Weer",
                "given_name": "Claire V.",
                "clpid": "Weer-C-V"
            },
            {
                "family_name": "Weng",
                "given_name": "Li",
                "clpid": "Weng-Li"
            },
            {
                "family_name": "Lewis",
                "given_name": "Keith D.",
                "clpid": "Lewis-K-D"
            },
            {
                "family_name": "Shoukry",
                "given_name": "Malak I.",
                "clpid": "Shoukry-M"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Keys",
                "given_name": "David N.",
                "clpid": "Keys-D-N"
            },
            {
                "family_name": "Rubin",
                "given_name": "Edward M.",
                "clpid": "Rubin-E-M"
            }
        ],
        "abstract": "Analysis of sequence variation among members of a single species offers a potential approach to identify functional DNA elements responsible for biological features unique to that species. Due to its high rate of allelic polymorphism and ease of genetic manipulability, we chose the sea squirt, Ciona intestinalis, to explore intraspecies sequence comparisons for genome annotation. A large number of C. intestinalis specimens were collected from four continents, and a set of genomic intervals were amplified, resequenced, and analyzed to determine the mutation rates at each nucleotide in the sequence. We found that regions with low mutation rates efficiently demarcated functionally constrained sequences: these include a set of noncoding elements, which we showed in C. intestinalis transgenic assays to act as tissue-specific enhancers, as well as the location of coding sequences. This illustrates that comparisons of multiple members of a species can be used for genome annotation, suggesting a path for the annotation of the sequenced genomes of organisms occupying uncharacterized phylogenetic branches of the animal kingdom. It also raises the possibility that the resequencing of a large number of Homo sapiens individuals might be used to annotate the human genome and identify sequences defining traits unique to our species.",
        "doi": "10.1101/gr.3199704",
        "pmcid": "PMC534664",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2004-12",
        "series_number": "12",
        "volume": "14",
        "issue": "12",
        "pages": "2406-2411"
    },
    {
        "id": "authors:hn6be-bya11",
        "collection": "authors",
        "collection_id": "hn6be-bya11",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-081738298",
        "type": "article",
        "title": "Parametric Inference for Biological Sequence Analysis",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            }
        ],
        "abstract": "One of the major successes in computational biology has been the unification, by using the graphical model formalism, of a multitude of algorithms for annotating and comparing biological sequences. Graphical models that have been applied to these problems include hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems that are associated with different statistical models. This article introduces the polytope propagation algorithm for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.",
        "doi": "10.1073/pnas.0406011101",
        "pmcid": "PMC528961",
        "issn": "0027-8424",
        "publisher": "National Academy of Sciences",
        "publication": "Proceedings of the National Academy of Sciences of the United States of America",
        "publication_date": "2004-11-16",
        "series_number": "46",
        "volume": "101",
        "issue": "46",
        "pages": "16138-16143"
    },
    {
        "id": "authors:9npck-fag88",
        "collection": "authors",
        "collection_id": "9npck-fag88",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-073504137",
        "type": "article",
        "title": "Tropical Geometry of Statistical Models",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Sturmfels",
                "given_name": "Bernd",
                "clpid": "Sturmfels-B"
            }
        ],
        "abstract": "This article presents a unified mathematical framework for inference in graphical models, building on the observation that graphical models are algebraic varieties. From this geometric viewpoint, observations generated from a model are coordinates of a point in the variety, and the sum-product algorithm is an efficient tool for evaluating specific coordinates. Here, we address the question of how the solutions to various inference problems depend on the model parameters. The proposed answer is expressed in terms of tropical algebraic geometry. The Newton polytope of a statistical model plays a key role. Our results are applied to the hidden Markov model and the general Markov model on a binary tree.",
        "doi": "10.1073/pnas.0406010101",
        "pmcid": "PMC528960",
        "issn": "0027-8424",
        "publisher": "National Academy of Sciences",
        "publication": "Proceedings of the National Academy of Sciences of the United States of America",
        "publication_date": "2004-11-16",
        "series_number": "46",
        "volume": "101",
        "issue": "46",
        "pages": "16132-16137"
    },
    {
        "id": "authors:h1p73-wax92",
        "collection": "authors",
        "collection_id": "h1p73-wax92",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-133717579",
        "type": "article",
        "title": "The ENCODE (ENCyclopedia Of DNA Elements) Project",
        "author": [
            {
                "family_name": "Feingold",
                "given_name": "E. A.",
                "clpid": "Feingold-E-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "L.",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "literal": "ENCODE Project Consortium"
            }
        ],
        "abstract": "The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. The pilot phase of the Project is focused on a specified 30 megabases (\u223c1%) of the human genome sequence and is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. The results of this pilot phase will guide future efforts to analyze the entire human genome.",
        "doi": "10.1126/science.1105136",
        "issn": "0036-8075",
        "publisher": "American Association for the Advancement of Science",
        "publication": "Science",
        "publication_date": "2004-10-22",
        "series_number": "5696",
        "volume": "306",
        "issue": "5696",
        "pages": "636-640"
    },
    {
        "id": "authors:h1v4b-fzh55",
        "collection": "authors",
        "collection_id": "h1v4b-fzh55",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-135943475",
        "type": "article",
        "title": "Multiple-sequence functional annotation and the generalized hidden Markov phylogeny",
        "author": [
            {
                "family_name": "McAuliffe",
                "given_name": "Jon D.",
                "clpid": "McAuliffe-J-D"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Jordan",
                "given_name": "Michael I.",
                "clpid": "Jordan-M-I"
            }
        ],
        "abstract": "Motivation: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. \n\nResults: We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe shadower, our implementation of such a prediction system. We find that shadower outperforms previously reported ab initio gene finders, including comparative human\u2013mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of shadower's performance which reveals that as few as five well-chosen species may suffice to attain maximal sensitivity and specificity in exon demarcation. \n\nAvailability: A Web server is available at http://bonaire.lbl.gov/shadower",
        "doi": "10.1093/bioinformatics/bth153",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2004-08-12",
        "series_number": "12",
        "volume": "20",
        "issue": "12",
        "pages": "1850-1860"
    },
    {
        "id": "authors:w4ajn-sw675",
        "collection": "authors",
        "collection_id": "w4ajn-sw675",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-140652072",
        "type": "article",
        "title": "VISTA: computational tools for comparative genomics",
        "author": [
            {
                "family_name": "Frazer",
                "given_name": "Kelly A.",
                "clpid": "Frazier-K-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Poliakov",
                "given_name": "Alexander",
                "clpid": "Poliakov-A-N-B"
            },
            {
                "family_name": "Rubin",
                "given_name": "Edward M.",
                "clpid": "Rubin-E-M"
            },
            {
                "family_name": "Dubchak",
                "given_name": "Inna",
                "clpid": "Dubchak-I"
            }
        ],
        "abstract": "Comparison of DNA sequences from different species is a fundamental method for identifying functional elements in genomes. Here, we describe the VISTA family of tools created to assist biologists in carrying out this task. Our first VISTA server at http://www-gsd.lbl.gov/vista/ was launched in the summer of 2000 and was designed to align long genomic sequences and visualize these alignments with associated functional annotations. Currently the VISTA site includes multiple comparative genomics tools and provides users with rich capabilities to browse pre-computed whole-genome alignments of large vertebrate genomes and other groups of organisms with VISTA Browser, to submit their own sequences of interest to several VISTA servers for various types of comparative analysis and to obtain detailed comparative analysis results for a set of cardiovascular genes. We illustrate capabilities of the VISTA site by the analysis of a 180 kb interval on human chromosome 5 that encodes for the kinesin family member 3A (KIF3A) protein.",
        "doi": "10.1093/nar/gkh458",
        "pmcid": "PMC441596",
        "issn": "0305-1048",
        "publisher": "Oxford University Press",
        "publication": "Nucleic Acids Research",
        "publication_date": "2004-07-01",
        "series_number": "Suppl. 2",
        "volume": "32",
        "issue": "Suppl. 2",
        "pages": "W273-W279"
    },
    {
        "id": "authors:2v174-rtt49",
        "collection": "authors",
        "collection_id": "2v174-rtt49",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-080948323",
        "type": "article",
        "title": "Reconstructing Trees from Subtree Weights",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "L.",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Speyer",
                "given_name": "D.",
                "clpid": "Speyer-D"
            }
        ],
        "abstract": "The tree-metric theorem provides a necessary and sufficient condition for a dissimilarity matrix to be a tree metric, and has served as the foundation for numerous distance-based reconstruction methods in phylogenetics. Our main result is an extension of the tree-metric theorem to more general dissimilarity maps. In particular, we show that a tree with n leaves is reconstructible from the weights of the m-leaf subtrees provided that n \u2265 2m - 1.",
        "doi": "10.1016/S0893-9659(04)90095-X",
        "issn": "0893-9659",
        "publisher": "Elsevier",
        "publication": "Applied Mathematics Letters",
        "publication_date": "2004-06",
        "series_number": "6",
        "volume": "17",
        "issue": "6",
        "pages": "615-621"
    },
    {
        "id": "authors:t95d7-66x59",
        "collection": "authors",
        "collection_id": "t95d7-66x59",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-145138306",
        "type": "article",
        "title": "Genome sequence of the Brown Norway rat yields insights into mammalian evolution",
        "author": [
            {
                "family_name": "Gibbs",
                "given_name": "Richard A.",
                "clpid": "Gibbs-R-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "literal": "Rat Genome Sequencing Project Consortium"
            }
        ],
        "abstract": "The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.",
        "doi": "10.1038/nature02426",
        "issn": "0028-0836",
        "publisher": "Nature Publishing Group",
        "publication": "Nature",
        "publication_date": "2004-04-01",
        "series_number": "6982",
        "volume": "428",
        "issue": "6982",
        "pages": "493-521"
    },
    {
        "id": "authors:vf6w0-hrb56",
        "collection": "authors",
        "collection_id": "vf6w0-hrb56",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-144803591",
        "type": "article",
        "title": "Identification of Evolutionary Hotspots in the Rodent Genomes",
        "author": [
            {
                "family_name": "Yap",
                "given_name": "Von Bing",
                "clpid": "Yap-Von-Bing"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We describe a whole-genome comparative analysis of the human, mouse, and rat genomes to describe the average substitution patterns of four genomic regions: ancient repeats, rodent-specific DNA, exons, and conserved (coding and noncoding) regions, and to identify rodent evolutionary hotspots. In all types of regions, except the rodent-specific DNA, the rat branch is slightly longer than the mouse branch. Moreover, the mouse\u2013rat distance is longer in the rodent-specific DNA than in the ancient repeats. Analysis of individual conserved regions with different substitution models yielded the conclusion that the Jukes\u2013Cantor model is inadequate, and the Hasegawa\u2013Kishino\u2013Yano model is almost as good as the REV model. Using human as an outgroup, we identified 5055 evolutionary hotspots, which are highly conserved subalignment blocks (each consisting of at least 100 aligned sites and a small fraction of gaps) with a large and statistically significant difference in the branch lengths of the rodent species. The cutoffs used to identify the hotspots are partially based on estimates of the average rates of substitution. The fractions of hotspots overlapping with the rodent RefSeq genes, RefSeq exons, and ESTs are all higher than expected. Still, more than half of the hotspots lie in noncoding regions of the mouse genome. We believe that the hotspots represent biologically interesting regions in the rodent genomes.",
        "doi": "10.1101/gr.1967904",
        "pmcid": "PMC383301",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2004-04",
        "series_number": "4",
        "volume": "14",
        "issue": "4",
        "pages": "574-579"
    },
    {
        "id": "authors:bzm2q-a6208",
        "collection": "authors",
        "collection_id": "bzm2q-a6208",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-142044905",
        "type": "article",
        "title": "Visualization of Multiple Genome Annotations and Alignments With the K-BROWSER",
        "author": [
            {
                "family_name": "Chakrabarti",
                "given_name": "Kushal",
                "clpid": "Chakrabarti-K"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We introduce a novel genome browser application, the K-BROWSER, that allows intuitive visualization of biological information across an arbitrary number of multiply aligned genomes. In particular, the K-BROWSER simultaneously displays an arbitrary number of genomes both through overlaid annotations and predictions that describe their respective characteristics, and through the multiple alignment that describes their global relationship to one another. The browsing environment has been designed to allow users seamless access to information available in every genome and, furthermore, to allow easy navigation within and between genomes. As of the date of publication, the K-BROWSER has been set up on the human, mouse, and rat genomes.",
        "doi": "10.1101/gr.1957004",
        "pmcid": "PMC383318",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2004-04",
        "series_number": "4",
        "volume": "14",
        "issue": "4",
        "pages": "716-720"
    },
    {
        "id": "authors:nb32q-zt780",
        "collection": "authors",
        "collection_id": "nb32q-zt780",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170307-074220313",
        "type": "article",
        "title": "MAVID: Constrained ancestral alignment of multiple sequences",
        "author": [
            {
                "family_name": "Bray",
                "given_name": "Nicolas",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We describe a new global multiple-alignment program capable of aligning a large number of genomic regions. Our progressive-alignment approach incorporates the following ideas: maximum-likelihood inference of ancestral sequences, automatic guide-tree construction, protein-based anchoring of ab-initio gene predictions, and constraints derived from a global homology map of the sequences. We have implemented these ideas in the MAVID program, which is able to accurately align multiple genomic regions up to megabases long. MAVID is able to effectively align divergent sequences, as well as incomplete unfinished sequences. We demonstrate the capabilities of the program on the benchmark CFTR region, which consists of 1.8 Mb of human sequence and 20 orthologous regions in marsupials, birds, fish, and mammals. Finally, we describe two large MAVID alignments, an alignment of all the available HIV genomes and a multiple alignment of the entire human, mouse, and rat genomes.",
        "doi": "10.1101/gr.1960404",
        "pmcid": "PMC383315",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2004-04",
        "series_number": "4",
        "volume": "14",
        "issue": "4",
        "pages": "693-699"
    },
    {
        "id": "authors:9kwva-gf330",
        "collection": "authors",
        "collection_id": "9kwva-gf330",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-144150791",
        "type": "article",
        "title": "Accurate Identification of Novel Human Genes Through Simultaneous Gene Prediction in Human, Mouse, and Rat",
        "author": [
            {
                "family_name": "Dewey",
                "given_name": "Colin",
                "clpid": "Dewey-C-N"
            },
            {
                "family_name": "Wu",
                "given_name": "Jia Qian",
                "clpid": "Wu-Jia-Qian"
            },
            {
                "family_name": "Cawley",
                "given_name": "Simon",
                "clpid": "Cawley-S"
            },
            {
                "family_name": "Alexandersson",
                "given_name": "Marina",
                "clpid": "Alexandersson-M"
            },
            {
                "family_name": "Gibbs",
                "given_name": "Richard",
                "clpid": "Gibbs-R-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We describe a new method for simultaneously identifying novel homologous genes with identical structure in the human, mouse, and rat genomes by combining pairwise predictions made with the SLAM gene-finding program. Using this method, we found 3698 gene triples in the human, mouse, and rat genomes which are predicted with exactly the same gene structure. We show, both computationally and experimentally, that the introns of these triples are predicted accurately as compared with the introns of other ab initio gene prediction sets. Computationally, we compared the introns of these gene triples, as well as those from other ab initio gene finders, with known intron annotations. We show that a unique property of SLAM, namely that it predicts gene structures simultaneously in two organisms, is key to producing sets of predictions that are highly accurate in intron structure when combined with other programs. Experimentally, we performed reverse transcription-polymerase chain reaction (RT-PCR) in both the human and rat to test the exon pairs flanking introns from a subset of the gene triples for which the human gene had not been previously identified. By performing RT-PCR on orthologous introns in both the human and rat genomes, we additionally explore the validity of using RT-PCR as a method for confirming gene predictions.",
        "doi": "10.1101/gr.1939804",
        "pmcid": "PMC383310",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2004-04",
        "series_number": "4",
        "volume": "14",
        "issue": "4",
        "pages": "661-664"
    },
    {
        "id": "authors:dkg16-32f71",
        "collection": "authors",
        "collection_id": "dkg16-32f71",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-151105303",
        "type": "article",
        "title": "HMM sampling and applications to gene finding and alternative splicing",
        "author": [
            {
                "family_name": "Cawley",
                "given_name": "Simon L.",
                "clpid": "Cawley-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The standard method of applying hidden Markov models to biological problems is to find a Viterbi (maximal weight) path through the HMM graph. The Viterbi algorithm reduces the problem of finding the most likely hidden state sequence that explains given observations, to a dynamic programming problem for corresponding directed acyclic graphs. For example, in the gene finding application, the HMM is used to find the most likely underlying gene structure given a DNA sequence. In this note we discuss the applications of sampling methods for HMMs. The standard sampling algorithm for HMMs is a variant of the common forward-backward and backtrack algorithms, and has already been applied in the context of Gibbs sampling methods. Nevetheless, the practice of sampling state paths from HMMs does not seem to have been widely adopted, and important applications have been overlooked. We show how sampling can be used for finding alternative splicings for genes, including alternative splicings that are conserved between genes from related organisms. We also show how sampling from the posterior distribution is a natural way to compute probabilities for predicted exons and gene structures being correct under the assumed model. Finally, we describe a new memory efficient sampling algorithm for certain classes of HMMs which provides a practical sampling alternative to the Hirschberg algorithm for optimal alignment. The ideas presented have applications not only to gene finding and HMMs but more generally to stochastic context free grammars and RNA structure prediction.",
        "doi": "10.1093/bioinformatics/btg1057",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2003-09-27",
        "series_number": "Suppl 2",
        "volume": "19",
        "issue": "Suppl 2",
        "pages": "ii36-ii41"
    },
    {
        "id": "authors:n5qhb-91b08",
        "collection": "authors",
        "collection_id": "n5qhb-91b08",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-151538288",
        "type": "article",
        "title": "Forcing numbers of stop signs",
        "author": [
            {
                "family_name": "Lam",
                "given_name": "Fumei",
                "clpid": "Lam-Fumei"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Let G be a graph with a perfect matching M. The forcing number of M is the smallest number of edges in a subset S\u2282M such that S is contained in no other perfect matching of G. We present methods for determining bounds on forcing numbers and apply these methods to find bounds for the forcing numbers of stop signs. A consequence of our main result is that every perfect matching of a stop sign of size (n,k) contains at least n disjoint alternating cycles.",
        "doi": "10.1016/S0304-3975(02)00499-1",
        "issn": "0304-3975",
        "publisher": "Elsevier",
        "publication": "Theoretical Computer Science",
        "publication_date": "2003-07-15",
        "series_number": "2-3",
        "volume": "303",
        "issue": "2-3",
        "pages": "409-416"
    },
    {
        "id": "authors:07qjz-fdj71",
        "collection": "authors",
        "collection_id": "07qjz-fdj71",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-152103068",
        "type": "article",
        "title": "MAVID multiple alignment server",
        "author": [
            {
                "family_name": "Bray",
                "given_name": "Nicolas",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "MAVID is a multiple alignment program suitable for many large genomic regions. The MAVID web server allows biomedical researchers to quickly obtain multiple alignments for genomic sequences and to subsequently analyse the alignments for conserved regions. MAVID has been successfully used for the alignment of closely related species such as primates and also for the alignment of more distant organisms such as human and fugu. The server is fast, capable of aligning hundreds of kilobases in less than a minute. The multiple alignment is used to build a phylogenetic tree for the sequences, which is subsequently used as a basis for identifying conserved regions in the alignment. The server can be accessed at http://baboon.math.berkeley.edu/mavid/.",
        "doi": "10.1093/nar/gkg623",
        "pmcid": "PMC169029",
        "issn": "1362-4962",
        "publisher": "Oxford University Press",
        "publication": "Nucleic Acids Research",
        "publication_date": "2003-07-01",
        "series_number": "13",
        "volume": "31",
        "issue": "13",
        "pages": "3525-3526"
    },
    {
        "id": "authors:s4hgs-n2n71",
        "collection": "authors",
        "collection_id": "s4hgs-n2n71",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-152527629",
        "type": "article",
        "title": "SLAM web server for comparative gene finding and alignment",
        "author": [
            {
                "family_name": "Cawley",
                "given_name": "Simon",
                "clpid": "Cawley-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Alexandersson",
                "given_name": "Marina",
                "clpid": "Alexandersson-M"
            }
        ],
        "abstract": "SLAM is a program that simultaneously aligns and annotates pairs of homologous sequences. The SLAM web server integrates SLAM with repeat masking tools and the AVID alignment program to allow for rapid alignment and gene prediction in user submitted sequences. Along with annotations and alignments for the submitted sequences, users obtain a list of predicted conserved non-coding sequences (and their associated alignments). The web site also links to whole genome annotations of the human, mouse and rat genomes produced with the SLAM program. The server can be accessed at http://bio.math.berkeley.edu/slam.",
        "doi": "10.1093/nar/gkg583",
        "pmcid": "PMC168989",
        "issn": "1362-4962",
        "publisher": "Oxford University Press",
        "publication": "Nucleic Acids Research",
        "publication_date": "2003-07-01",
        "series_number": "13",
        "volume": "31",
        "issue": "13",
        "pages": "3507-3509"
    },
    {
        "id": "authors:azh9j-rwv53",
        "collection": "authors",
        "collection_id": "azh9j-rwv53",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-153301235",
        "type": "article",
        "title": "Picking Alignments from (Steiner) Trees",
        "author": [
            {
                "family_name": "Lam",
                "given_name": "Fumei",
                "clpid": "Lam-Fumei"
            },
            {
                "family_name": "Alexandersson",
                "given_name": "Marina",
                "clpid": "Alexandersson-M"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The application of Needleman\u2013Wunsch alignment techniques to biological sequences is complicated by two serious problems when the sequences are long: the running time, which scales as the product of the lengths of sequences, and the difficulty in obtaining suitable parameters that produce meaningful alignments. The running time problem is often corrected by reducing the search space, using techniques such as banding, or chaining of high-scoring pairs. The parameter problem is more difficult to fix, partly because the probabilistic model, which Needleman\u2013Wunsch is equivalent to, does not capture a key feature of biological sequence alignments, namely the alternation of conserved blocks and seemingly unrelated nonconserved segments. We present a solution to the problem of designing efficient search spaces for pair hidden Markov models that align biological sequences by taking advantage of their associated features. Our approach leads to an optimization problem, for which we obtain a 2-approximation algorithm, and that is based on the construction of Manhattan networks, which are close relatives of Steiner trees. We describe the underlying theory and show how our methods can be applied to alignment of DNA sequences in practice, succesfully reducing the Viterbi algorithm search space of alignment PHMMs by three orders of magnitude.",
        "doi": "10.1089/10665270360688156",
        "issn": "1066-5277",
        "publisher": "Mary Ann Liebert, Inc.",
        "publication": "Journal of Computational Biology",
        "publication_date": "2003-07",
        "series_number": "3-4",
        "volume": "10",
        "issue": "3-4",
        "pages": "509-520"
    },
    {
        "id": "authors:ahkxa-cwz69",
        "collection": "authors",
        "collection_id": "ahkxa-cwz69",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-154151410",
        "type": "article",
        "title": "SLAM: Cross-Species Gene Finding and Alignment with a Generalized Pair Hidden Markov Model",
        "author": [
            {
                "family_name": "Alexandersson",
                "given_name": "Marina",
                "clpid": "Alexandersson-M"
            },
            {
                "family_name": "Cawley",
                "given_name": "Simon",
                "clpid": "Cawley-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Comparative-based gene recognition is driven by the principle that conserved regions between related organisms are more likely than divergent regions to be coding. We describe a probabilistic framework for gene structure and alignment that can be used to simultaneously find both the gene structure and alignment of two syntenic genomic regions. A key feature of the method is the ability to enhance gene predictions by finding the best alignment between two syntenic sequences, while at the same time finding biologically meaningful alignments that preserve the correspondence between coding exons. Our probabilistic framework is the generalized pair hidden Markov model, a hybrid of (1) generalized hidden Markov models, which have been used previously for gene finding, and (2) pair hidden Markov models, which have applications to sequence alignment. We have built a gene finding and alignment program called SLAM, which aligns and identifies complete exon/intron structures of genes in two related but unannotated sequences of DNA. SLAM is able to reliably predict gene structures for any suitably related pair of organisms, most notably with fewer false-positive predictions compared to previous methods (examples are provided for Homo sapiens/Mus musculus andPlasmodium falciparum/Plasmodium vivax comparisons). Accuracy is obtained by distinguishing conserved noncoding sequence (CNS) from conserved coding sequence. CNS annotation is a novel feature of SLAM and may be useful for the annotation of UTRs, regulatory elements, and other noncoding features.",
        "doi": "10.1101/gr.424203",
        "pmcid": "PMC430255",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2003-03-01",
        "series_number": "3",
        "volume": "13",
        "issue": "3",
        "pages": "496-502"
    },
    {
        "id": "authors:hty8p-scq52",
        "collection": "authors",
        "collection_id": "hty8p-scq52",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-154624549",
        "type": "article",
        "title": "Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome",
        "author": [
            {
                "family_name": "Boffelli",
                "given_name": "Dario",
                "clpid": "Boffelli-D"
            },
            {
                "family_name": "McAuliffe",
                "given_name": "Jon",
                "clpid": "McAuliffe-J-D"
            },
            {
                "family_name": "Ovcharenko",
                "given_name": "Dmitriy",
                "clpid": "Ovcharenko-D"
            },
            {
                "family_name": "Lewis",
                "given_name": "Keith D.",
                "clpid": "Lewis-K-D"
            },
            {
                "family_name": "Ovcharenko",
                "given_name": "Ivan",
                "clpid": "Ovcharenko-I"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Rubin",
                "given_name": "Edward M.",
                "clpid": "Rubin-E-M"
            }
        ],
        "abstract": "Nonhuman primates represent the most relevant model organisms to understand the biology of Homo sapiens. The recent divergence and associated overall sequence conservation between individual members of this taxon have nonetheless largely precluded the use of primates in comparative sequence studies. We used sequence comparisons of an extensive set of Old World and New World monkeys and hominoids to identify functional regions in the human genome. Analysis of these data enabled the discovery of primate-specific gene regulatory elements and the demarcation of the exons of multiple genes. Much of the information content of the comprehensive primate sequence comparisons could be captured with a small subset of phylogenetically close primates. These results demonstrate the utility of intraprimate sequence comparisons to discover common mammalian as well as primate-specific functional elements in the human genome, which are unattainable through the evaluation of more evolutionarily distant species.",
        "doi": "10.1126/science.1081331",
        "issn": "0036-8075",
        "publisher": "American Association for the Advancement of Science",
        "publication": "Science",
        "publication_date": "2003-02-28",
        "series_number": "5611",
        "volume": "299",
        "issue": "5611",
        "pages": "1391-1394"
    },
    {
        "id": "authors:ps1wq-rss05",
        "collection": "authors",
        "collection_id": "ps1wq-rss05",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-155631683",
        "type": "article",
        "title": "AVID: A Global Alignment Program",
        "author": [
            {
                "family_name": "Bray",
                "given_name": "Nick",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Dubchak",
                "given_name": "Inna",
                "clpid": "Dubchak-I"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "In this paper we describe a new global alignment method called AVID. The method is designed to be fast, memory efficient, and practical for sequence alignments of large genomic regions up to megabases long. We present numerous applications of the method, ranging from the comparison of assemblies to alignment of large syntenic genomic regions and whole genome human/mouse alignments. We have also performed a quantitative comparison of AVID with other popular alignment tools. To this end, we have established a format for the representation of alignments and methods for their comparison. These formats and methods should be useful for future studies. The tools we have developed for the alignment comparisons, as well as the AVID program, are publicly available. See Web Site References section for AVID Web address and Web addresses for other programs discussed in this paper.",
        "doi": "10.1101/gr.789803",
        "pmcid": "PMC430967",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2003-01-01",
        "series_number": "1",
        "volume": "13",
        "issue": "1",
        "pages": "97-102"
    },
    {
        "id": "authors:2yrxj-wqw78",
        "collection": "authors",
        "collection_id": "2yrxj-wqw78",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170308-160145750",
        "type": "article",
        "title": "Strategies and Tools for Whole-Genome Alignments",
        "author": [
            {
                "family_name": "Couronne",
                "given_name": "Olivier",
                "clpid": "Couronne-O"
            },
            {
                "family_name": "Poliakov",
                "given_name": "Alexander",
                "clpid": "Poliakov-A-N-B"
            },
            {
                "family_name": "Bray",
                "given_name": "Nicolas",
                "clpid": "Bray-N-L"
            },
            {
                "family_name": "Ishkhanov",
                "given_name": "Tigran",
                "clpid": "Ishkhanov-T"
            },
            {
                "family_name": "Ryaboy",
                "given_name": "Dmitriy",
                "clpid": "Ryaboy-D"
            },
            {
                "family_name": "Rubin",
                "given_name": "Edward",
                "clpid": "Rubin-E-M"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Dubchak",
                "given_name": "Inna",
                "clpid": "Dubchak-I"
            }
        ],
        "abstract": "The availability of the assembled mouse genome makes possible, for the first time, an alignment and comparison of two large vertebrate genomes. We investigated different strategies of alignment for the subsequent analysis of conservation of genomes that are effective for assemblies of different quality. These strategies were applied to the comparison of the working draft of the human genome with the Mouse Genome Sequencing Consortium assembly, as well as other intermediate mouse assemblies. Our methods are fast and the resulting alignments exhibit a high degree of sensitivity, covering more than 90% of known coding exons in the human genome. We obtained such coverage while preserving specificity. With a view towards the end user, we developed a suite of tools and Web sites for automatically aligning and subsequently browsing and working with whole-genome comparisons. We describe the use of these tools to identify conserved non-coding regions between the human and mouse genomes, some of which have not been identified by other methods.",
        "doi": "10.1101/gr.762503",
        "pmcid": "PMC430965",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2003-01-01",
        "series_number": "1",
        "volume": "13",
        "issue": "1",
        "pages": "73-80"
    },
    {
        "id": "authors:0x565-8ph71",
        "collection": "authors",
        "collection_id": "0x565-8ph71",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-090859678",
        "type": "article",
        "title": "Initial sequencing and comparative analysis of the mouse genome",
        "author": [
            {
                "family_name": "Waterston",
                "given_name": "Robert H.",
                "clpid": "Waterston-R-H"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "literal": "Mouse Genome Sequencing Consortium"
            }
        ],
        "abstract": "The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.",
        "doi": "10.1038/nature01262",
        "issn": "0028-0836",
        "publisher": "Nature Publishing Group",
        "publication": "Nature",
        "publication_date": "2002-12-05",
        "series_number": "6915",
        "volume": "420",
        "issue": "6915",
        "pages": "520-562"
    },
    {
        "id": "authors:nvtfr-fpx06",
        "collection": "authors",
        "collection_id": "nvtfr-fpx06",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-092737635",
        "type": "article",
        "title": "Applications of Generalized Pair Hidden Markov Models to Alignment and Gene Finding Problems",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Alexandersson",
                "given_name": "Marina",
                "clpid": "Alexandersson-M"
            },
            {
                "family_name": "Cawley",
                "given_name": "Simon",
                "clpid": "Cawley-S"
            }
        ],
        "abstract": "Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology, ranging from alignment problems to gene finding and annotation. Alignment problems can be solved with pair HMMs, while gene finding programs rely on generalized HMMs in order to model exon lengths. In this paper, we introduce the generalized pair HMM (GPHMM), which is an extension of both pair and generalized HMMs. We show how GPHMMs, in conjunction with approximate alignments, can be used for cross-species gene finding and describe applications to DNA\u2013cDNA and DNA\u2013protein alignment. GPHMMs provide a unifying and probabilistically sound theory for modeling these problems.",
        "doi": "10.1089/10665270252935520",
        "issn": "1066-5277",
        "publisher": "Mary Ann Liebert, Inc.",
        "publication": "Journal of Computational Biology",
        "publication_date": "2002-07",
        "series_number": "2",
        "volume": "9",
        "issue": "2",
        "pages": "389-399"
    },
    {
        "id": "authors:0947j-v2n07",
        "collection": "authors",
        "collection_id": "0947j-v2n07",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-093325277",
        "type": "article",
        "title": "rVista for Comparative Sequence-Based Discovery of Functional Transcription Factor Binding Sites",
        "author": [
            {
                "family_name": "Loots",
                "given_name": "Gabriela G.",
                "clpid": "Loots-G-G"
            },
            {
                "family_name": "Ovcharenko",
                "given_name": "Ivan",
                "clpid": "Ovcharenko-I"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Dubchak",
                "given_name": "Inna",
                "clpid": "Dubchak-I"
            },
            {
                "family_name": "Rubin",
                "given_name": "Edward M.",
                "clpid": "Rubin-E-M"
            }
        ],
        "abstract": "Identifying transcriptional regulatory elements represents a significant challenge in annotating the genomes of higher vertebrates. We have developed a computational tool, rVISTA, for high-throughput discovery of cis-regulatory elements that combines clustering of predicted transcription factor binding sites (TFBSs) and the analysis of interspecies sequence conservation to maximize the identification of functional sites. To assess the ability of rVISTA to discover true positive TFBSs while minimizing the prediction of false positives, we analyzed the distribution of several TFBSs across 1 Mb of the well-annotated cytokine gene cluster (Hs5q31; Mm11). Because a large number of AP-1, NFAT, and GATA-3 sites have been experimentally identified in this interval, we focused our analysis on the distribution of all binding sites specific for these transcription factors. The exploitation of the orthologous human\u2013mouse dataset resulted in the elimination of &gt;95% of the \u223c58,000 binding sites predicted on analysis of the human sequence alone, whereas it identified 88% of the experimentally verified binding sites in this region.",
        "doi": "10.1101/gr.225502",
        "pmcid": "PMC186580",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2002-05",
        "series_number": "5",
        "volume": "12",
        "issue": "5",
        "pages": "832-839"
    },
    {
        "id": "authors:mg9tz-23986",
        "collection": "authors",
        "collection_id": "mg9tz-23986",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-100309238",
        "type": "article",
        "title": "From First Base: The Sequence of the Tip of the X Chromosome of Drosophila melanogaster, a Comparison of Two Sequencing Strategies",
        "author": [
            {
                "family_name": "Benos",
                "given_name": "Panayiotis",
                "clpid": "Benos-P"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We present the sequence of a contiguous 2.63 Mb of DNA extending from the tip of the X chromosome ofDrosophila melanogaster. Within this sequence, we predict 277 protein coding genes, of which 94 had been sequenced already in the course of studying the biology of their gene products, and examples of 12 different transposable elements. We show that an interval between bands 3A2 and 3C2, believed in the 1970s to show a correlation between the number of bands on the polytene chromosomes and the 20 genes identified by conventional genetics, is predicted to contain 45 genes from its DNA sequence. We have determined the insertion sites ofP-elements from 111 mutant lines, about half of which are in a position likely to affect the expression of novel predicted genes, thus representing a resource for subsequent functional genomic analysis. We compare the European Drosophila Genome Project sequence with the corresponding part of the independently assembled and annotated Joint Sequence determined through \"shotgun\" sequencing. Discounting differences in the distribution of known transposable elements between the strains sequenced in the two projects, we detected three major sequence differences, two of which are probably explained by errors in assembly; the origin of the third major difference is unclear. In addition there are eight sequence gaps within the Joint Sequence. At least six of these eight gaps are likely to be sites of transposable elements; the other two are complex. Of the 275 genes in common to both projects, 60% are identical within 1% of their predicted amino-acid sequence and 31% show minor differences such as in choice of translation initiation or termination codons; the remaining 9% show major differences in interpretation.",
        "doi": "10.1101/gr.173801",
        "pmcid": "PMC311117",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2002-05",
        "series_number": "5",
        "volume": "11",
        "issue": "5",
        "pages": "710-730"
    },
    {
        "id": "authors:3f346-d0z81",
        "collection": "authors",
        "collection_id": "3f346-d0z81",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-093904962",
        "type": "article",
        "title": "The computational challenges of applying comparative-based computational methods to whole genomes",
        "author": [
            {
                "family_name": "Dubchak",
                "given_name": "Inna",
                "clpid": "Dubchak-I"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The explosion in genomic sequence avaliable in public databases has resulted in an unprecedented opportunity for computational whole genome analyses. A number of promising comparative-based approaches have been developed for gene finding, regulatory element discovery and other purposes, and it is clear that these tools will play a fundamental role in analysing the enormous amount of new data that is currently being generated. The synthesis of computationally intensive comparative computational approaches with the requirement for computational scientists. We focus on a few of these challenges, using by way of example the problems of alignment, gene and finding and regulatory element discovery, and discuss the issues that have arisen in attempts to solve these problems in the context of whole genome analysis pipelines.",
        "doi": "10.1093/bib/3.1.18",
        "issn": "1467-5463",
        "publisher": "Oxford University Press",
        "publication": "Briefings in Bioinformatics",
        "publication_date": "2002-03",
        "series_number": "1",
        "volume": "3",
        "issue": "1",
        "pages": "18-22"
    },
    {
        "id": "authors:hnyxk-9bm85",
        "collection": "authors",
        "collection_id": "hnyxk-9bm85",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-095230836",
        "type": "article",
        "title": "Mapping and identification of essential gene functions on the X chromosome of Drosophila",
        "author": [
            {
                "family_name": "Peter",
                "given_name": "Annette",
                "clpid": "Peter-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The Drosophila melanogaster genome consists of four chromosomes that contain 165 Mb of DNA, 120 Mb of which are euchromatic. The two Drosophila Genome Projects, in collaboration with Celera Genomics Systems, have sequenced the genome, complementing the previously established physical and genetic maps. In addition, the Berkeley Drosophila Genome Project has undertaken large\u2010scale functional analysis based on mutagenesis by transposable P element insertions into autosomes. Here, we present a large\u2010scale P element insertion screen for vital gene functions and a BAC tiling map for the X chromosome. A collection of 501 X\u2010chromosomal P element insertion lines was used to map essential genes cytogenetically and to establish short sequence tags (STSs) linking the insertion sites to the genome. The distribution of the P element integration sites, the identified genes and transcription units as well as the expression patterns of the P\u2010element\u2010tagged enhancers is described and discussed.",
        "doi": "10.1093/embo-reports/kvf012",
        "pmcid": "PMC1083931",
        "issn": "1469-221X",
        "publisher": "European Molecular Biology Organization",
        "publication": "EMBO Reports",
        "publication_date": "2002-01",
        "series_number": "1",
        "volume": "3",
        "issue": "1",
        "pages": "34-38"
    },
    {
        "id": "authors:tjked-pcn03",
        "collection": "authors",
        "collection_id": "tjked-pcn03",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-104254761",
        "type": "article",
        "title": "VISTA : visualizing global DNA sequence alignments of arbitrary length",
        "author": [
            {
                "family_name": "Mayor",
                "given_name": "Chris",
                "clpid": "Mayor-C"
            },
            {
                "family_name": "Brudno",
                "given_name": "Michael",
                "clpid": "Brudno-M"
            },
            {
                "family_name": "Schwartz",
                "given_name": "Jody R.",
                "clpid": "Schwartz-J-R"
            },
            {
                "family_name": "Poliakov",
                "given_name": "Alexander",
                "clpid": "Poliakov-A-N-B"
            },
            {
                "family_name": "Rubin",
                "given_name": "Edward M.",
                "clpid": "Rubin-E-M"
            },
            {
                "family_name": "Frazer",
                "given_name": "Kelly A.",
                "clpid": "Frazier-K-A"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior S.",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Dubchak",
                "given_name": "Inna",
                "clpid": "Dubchak-I"
            }
        ],
        "abstract": "VISTA is a program for visualizing global DNA sequence alignments of arbitrary length. It has a clean output, allowing for easy identification of similarity, and is easily configurable, enabling the visualization of alignments of various lengths at different levels of resolution. It is currently available on the web, thus allowing for easy access by all researchers. \n\nAvailability: VISTA server is available on the web at http://www-gsd.lbl.gov/vista. The source code is available upon request.",
        "doi": "10.1093/bioinformatics/16.11.1046",
        "issn": "1367-4803",
        "publisher": "Oxford University Press",
        "publication": "Bioinformatics",
        "publication_date": "2000-11",
        "series_number": "11",
        "volume": "16",
        "issue": "11",
        "pages": "1046-1047"
    },
    {
        "id": "authors:yas84-r6j70",
        "collection": "authors",
        "collection_id": "yas84-r6j70",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-111029375",
        "type": "article",
        "title": "Active Conservation of Noncoding Sequences Revealed by Three-Way Species Comparisons",
        "author": [
            {
                "family_name": "Dubchak",
                "given_name": "Inna",
                "clpid": "Dubchak-I"
            },
            {
                "family_name": "Brudno",
                "given_name": "Michael",
                "clpid": "Brudno-M"
            },
            {
                "family_name": "Loots",
                "given_name": "Gabriela G.",
                "clpid": "Loots-G-G"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Mayor",
                "given_name": "Chris",
                "clpid": "Mayor-C"
            },
            {
                "family_name": "Rubin",
                "given_name": "Edward M.",
                "clpid": "Rubin-E-M"
            },
            {
                "family_name": "Frazer",
                "given_name": "Kelly A.",
                "clpid": "Frazier-K-A"
            }
        ],
        "abstract": "Human and mouse genomic sequence comparisons are being increasingly used to search for evolutionarily conserved gene regulatory elements. Large-scale human\u2013mouse DNA comparison studies have discovered numerous conserved noncoding sequences of which only a fraction has been functionally investigated A question therefore remains as to whether most of these noncoding sequences are conserved because of functional constraints or are the result of a lack of divergence time. \n\n[The sequence data described in this paper have been submitted to the GenBank data library under accession nos. AF276990.]",
        "doi": "10.1101/gr.142200",
        "pmcid": "PMC310906",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2000-09",
        "series_number": "9",
        "volume": "10",
        "issue": "9",
        "pages": "1304-1306"
    },
    {
        "id": "authors:9zeww-dy998",
        "collection": "authors",
        "collection_id": "9zeww-dy998",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-110441139",
        "type": "article",
        "title": "Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction",
        "author": [
            {
                "family_name": "Batzoglou",
                "given_name": "Serafim",
                "clpid": "Batzoglou-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Mesirov",
                "given_name": "Jill P.",
                "clpid": "Mesirov-J-P"
            },
            {
                "family_name": "Berger",
                "given_name": "Bonnie",
                "clpid": "Berger-B"
            },
            {
                "family_name": "Lander",
                "given_name": "Eric S.",
                "orcid": "0000-0003-2662-4631",
                "clpid": "Lander-E-S"
            }
        ],
        "abstract": "We describe a novel analytical approach to gene recognition based on cross-species comparison. We first undertook a comparison of orthologous genomic loci from human and mouse, studying the extent of similarity in the number, size and sequence of exons and introns. We then developed an approach for recognizing genes within such orthologous regions by first aligning the regions using an iterative global alignment system and then identifying genes based on conservation of exonic features at aligned positions in both species. The alignment and gene recognition are performed by new programs calledGLASS and ROSETTA, respectively.ROSETTA performed well at exact identification of coding exons in 117 orthologous pairs tested.",
        "doi": "10.1101/gr.10.7.950",
        "pmcid": "PMC310911",
        "issn": "1088-9051",
        "publisher": "Cold Spring Harbor Laboratory Press",
        "publication": "Genome Research",
        "publication_date": "2000-07",
        "series_number": "7",
        "volume": "10",
        "issue": "7",
        "pages": "950-958"
    },
    {
        "id": "authors:vh6dd-k6133",
        "collection": "authors",
        "collection_id": "vh6dd-k6133",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-113000311",
        "type": "article",
        "title": "A Dictionary-Based Approach for Gene Annotation",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Batzoglou",
                "given_name": "Serafim",
                "clpid": "Batzoglou-S"
            },
            {
                "family_name": "Spitkovsky",
                "given_name": "Valentin I.",
                "clpid": "Spitkovsky-V-I"
            },
            {
                "family_name": "Banks",
                "given_name": "Eric",
                "clpid": "Banks-E"
            },
            {
                "family_name": "Lander",
                "given_name": "Eric S.",
                "orcid": "0000-0003-2662-4631",
                "clpid": "Lander-E-S"
            },
            {
                "family_name": "Kleitman",
                "given_name": "Daniel J.",
                "clpid": "Kleitman-D-J"
            },
            {
                "family_name": "Berger",
                "given_name": "Bonnie",
                "clpid": "Berger-B"
            }
        ],
        "abstract": "This paper describes a fast and fully automated dictionary-based approach to gene annotation and exon prediction. Two dictionaries are constructed, one from the nonredundant protein OWL database and the other from the dbEST database. These dictionaries are used to obtain O(1) time lookups of tuples in the dictionaries (4 tuples for the OWL database and 11 tuples for the dbEST database). These tuples can be used to rapidly find the longest matches at every position in an input sequence to the database sequences. Such matches provide very useful information pertaining to locating common segments between exons, alternative splice sites, and frequency data of long tuples for statistical purposes. These dictionaries also provide the basis for both homology determination, and statistical approaches to exon prediction.",
        "doi": "10.1089/106652799318364",
        "issn": "1066-5277",
        "publisher": "Mary Ann Liebert, Inc.",
        "publication": "Journal of Computational Biology",
        "publication_date": "1999-07",
        "series_number": "3-4",
        "volume": "6",
        "issue": "3-4",
        "pages": "419-430"
    },
    {
        "id": "authors:8cx0q-kp250",
        "collection": "authors",
        "collection_id": "8cx0q-kp250",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-141622723",
        "type": "article",
        "title": "Forcing matchings on square grids",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            },
            {
                "family_name": "Kim",
                "given_name": "Peter",
                "clpid": "Kim-Peter"
            }
        ],
        "abstract": "Let G be a graph that admits a perfect matching. The forcing number of a perfect matching M of G is defined as the smallest number of edges in a subset S \u2282 M, such that S is in no other perfect matching. We show that for the 2n \u00d7 2n square grid, the forcing number of any perfect matching is bounded below by n and above by n^2. Both bounds are sharp. We also establish a connection between the forcing problem and the minimum feedback set problem. Finally, we present some conjectures about forcing numbers in other graphs.",
        "doi": "10.1016/S0012-365X(97)00266-5",
        "issn": "0012-365X",
        "publisher": "Elsevier",
        "publication": "Discrete Mathematics",
        "publication_date": "1998-08-28",
        "series_number": "1-3",
        "volume": "190",
        "issue": "1-3",
        "pages": "287-294"
    },
    {
        "id": "authors:xbfv1-a5n80",
        "collection": "authors",
        "collection_id": "xbfv1-a5n80",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-114305555",
        "type": "article",
        "title": "Finding Convex Sets Among Points in the Plane",
        "author": [
            {
                "family_name": "Kleitman",
                "given_name": "D.",
                "clpid": "Kleitman-D-J"
            },
            {
                "family_name": "Pachter",
                "given_name": "L.",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "Let g(n) denote the least value such that any g(n) points in the plane in general position contain the vertices of a convex n-gon. In 1935, Erd\u0151s and Szekeres showed that g(n) exists, and they obtained the bounds 2^(n\u22122) + 1 \u2264 g(n) \u2264 (^(2n\u22124)_(n\u22122)) + 1. Chung and Graham have recently improved the upper bound by 1; the first improvement since the original Erd\u0151s\u2014Szekeres paper. We show that g(n) \u2264 (^(2n\u22124)_(n\u22122)) + 7 \u2212 2n.",
        "doi": "10.1007/PL00009358",
        "issn": "0179-5376",
        "publisher": "Springer",
        "publication": "Discrete and Computational Geometry",
        "publication_date": "1998-03",
        "series_number": "3",
        "volume": "19",
        "issue": "3",
        "pages": "405-410"
    },
    {
        "id": "authors:jxxfj-d1y91",
        "collection": "authors",
        "collection_id": "jxxfj-d1y91",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-142440565",
        "type": "article",
        "title": "Recent Developments in Computational Gene Recognition",
        "author": [
            {
                "family_name": "Batzoglou",
                "given_name": "Serafim",
                "clpid": "Batzoglou-S"
            },
            {
                "family_name": "Berger",
                "given_name": "Bonnie",
                "clpid": "Berger-B"
            },
            {
                "family_name": "Kleitman",
                "given_name": "Daniel J.",
                "clpid": "Kleitman-D-J"
            },
            {
                "family_name": "Lander",
                "given_name": "Eric S.",
                "orcid": "0000-0003-2662-4631",
                "clpid": "Lander-E-S"
            },
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We survey recent mathematical and computational work in the field of gene recognition, focusing on the techniques that have been developed to tackle the problem of identifying protein coding regions in genes. We also present a new approach to gene recognition which is based on a variety of tools we have developed.",
        "issn": "1431-0635",
        "publisher": "Deutsche Mathematiker-Vereinigung (DMV)",
        "publication": "Documenta Mathematica",
        "publication_date": "1998",
        "volume": "ICM I",
        "pages": "649-658"
    },
    {
        "id": "authors:ccpk8-h2947",
        "collection": "authors",
        "collection_id": "ccpk8-h2947",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-143338663",
        "type": "article",
        "title": "Constructing status injective graphs",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "The status, or distance sum, of a given vertex v in a graph is defined by s(v) = \u2211_(u \u2260 v)d(u, v) where d(u, v) is the distance from a vertex u to v. We show that every graph is the induced subgraph of a graph whose vertices all have distinct stati. Using this result we then construct a family of graphs which have consecutive integers for their stati. This settles the question raised by Harary and Buckley about whether there exist graphs whose stati are consecutive integers. We also use the above constructions to find families of non-isomorphic graphs with the same stati.",
        "doi": "10.1016/S0166-218X(97)00073-5",
        "issn": "0166-218X",
        "publisher": "Elsevier",
        "publication": "Discrete Applied Mathematics",
        "publication_date": "1997-12-05",
        "series_number": "1",
        "volume": "80",
        "issue": "1",
        "pages": "107-113"
    },
    {
        "id": "authors:mxdck-d5b36",
        "collection": "authors",
        "collection_id": "mxdck-d5b36",
        "cite_using_url": "https://resolver.caltech.edu/CaltechAUTHORS:20170309-144854496",
        "type": "article",
        "title": "Combinatorial Approaches and Conjectures for 2-Divisibility Problems Concerning Domino Tilings of Polyominoes",
        "author": [
            {
                "family_name": "Pachter",
                "given_name": "Lior",
                "orcid": "0000-0002-9164-6231",
                "clpid": "Pachter-L"
            }
        ],
        "abstract": "We give the first complete combinatorial proof of the fact that the number of domino tilings of the 2n\u00d72n square grid is of the form 2^n(2k + 1)^2, thus settling a question raised by John, Sachs, and Zernitz. The proof lends itself naturally to some interesting generalizations, and leads to a number of new conjectures.",
        "issn": "1077-8926",
        "publisher": "Electronic Journal of Combinatorics",
        "publication": "Electronic Journal of Combinatorics",
        "publication_date": "1997-11-08",
        "series_number": "1",
        "volume": "4",
        "issue": "1",
        "pages": "Art. No. R29"
    }
]