[
    {
        "id": "authors:3h7zz-szy25",
        "collection": "authors",
        "collection_id": "3h7zz-szy25",
        "cite_using_url": "https://authors.library.caltech.edu/records/3h7zz-szy25",
        "type": "monograph",
        "title": "Discovery of a phenazine\u2013thiol conjugase from sparse data using genome-informed machine learning",
        "author": [
            {
                "family_name": "Shan",
                "given_name": "Xiaoyu",
                "orcid": "0000-0001-9631-3244",
                "clpid": "Shan-Xiaoyu"
            },
            {
                "family_name": "Trindade",
                "given_name": "In\u00eas B.",
                "orcid": "0000-0002-6746-8455",
                "clpid": "Trindade-Ines-B"
            },
            {
                "family_name": "Glasser",
                "given_name": "Nathaniel R.",
                "orcid": "0000-0002-2833-5166",
                "clpid": "Glasser-Nathaniel-Robert"
            },
            {
                "family_name": "Thalhammer",
                "given_name": "Korbinian O.",
                "orcid": "0000-0001-6882-8611",
                "clpid": "Thalhammer-Korbinian-O"
            },
            {
                "family_name": "Scurria",
                "given_name": "Matthew",
                "orcid": "0009-0001-0598-2133",
                "clpid": "Scurria-Matthew"
            },
            {
                "family_name": "Mora",
                "given_name": "Ariane",
                "orcid": "0000-0003-1331-8192"
            },
            {
                "family_name": "Conway",
                "given_name": "Stuart J.",
                "orcid": "0000-0002-5148-117X"
            },
            {
                "family_name": "Newman",
                "given_name": "Dianne K.",
                "orcid": "0000-0003-1647-1918",
                "clpid": "Newman-D-K"
            }
        ],
        "abstract": "<p>Machine learning has enabled powerful biological discoveries using models trained on large datasets. However, for many important biological questions, such as identifying enzymes that transform understudied substrates, sparsity of training data is often a major bottleneck. Here, using phenazine natural products as a case study, we show that integrating genome-informed data augmentation with contrastive learning in protein language space enables identification of phenazine-interacting proteins starting from only 14 known phenazine modifying sequences. Applying this framework led to the discovery of PTC (Phenazine-Thiol Conjugase), the first enzyme known to catalyze phenazine thioconjugation, a phenazine modification reaction long observed but previously presumed to occur only through non-enzymatic chemistry. In silico simulation and experimental measurements demonstrate that PTC binds to both phenazine and glutathione as substrates. Recombinant expression and biochemical characterization reveal that PTC promotes glutathione-dependent modification of phenazines, yielding distinct reaction outcomes that depend on substrate identity. Although thiol-conjugated phenazine products exhibit reduced toxicity to bacterial cells, deletion of the gene encoding PTC does not confer a strong fitness disadvantage, illustrating how direct learning of sequences can uncover relevant enzymes that might evade phenotype-based genetic screens. Together, these results demonstrate that coupling comparative genomics with protein machine learning can convert &ldquo;small data&rdquo; typically outside the scope of machine learning into actionable predictive power, thereby facilitating enzyme discovery.</p>\n<div class=\"subsection\">\n<p><strong>Significance</strong> Machine learning excels when large, well-labeled datasets are available, yet many biologically important problems lack sufficient experimental data to support such approaches to discovery. This limitation is particularly acute for identifying enzymes acting on rare or understudied substrates. Here, we show that genomic organization can be leveraged as an additional source of biological information to address data sparsity. Starting with only 14 enzymes experimentally shown to modify phenazines, we developed a model identifying phenazine-interacting enzymes by integrating genome-informed data augmentation with protein machine learning. Guided by the model, we discovered the first enzyme known to catalyze thioconjugation modifications of phenazines, demonstrating a simple yet powerful strategy for extracting predictive insight from sparse biological knowledge.</p>\n</div>",
        "doi": "10.64898/2026.03.05.709892",
        "publisher": "bioRxiv",
        "publication_date": "2026-03-06"
    }
]