Unveiling Hidden Patterns: Novel Computational Techniques for Protein Motif and Motif-Binding Pocket Identification

Loading...
Thumbnail Image

Embargo End Date

2025-04-17

ICR Authors

Authors

Ali, H

Document Type

Thesis or Dissertation

Date

2024-10-17

Date Accepted

Abstract

Protein-protein interaction interfaces are crucial in cellular functions. Protein motif-mediated interactions play key regulatory roles in the cell and are under-explored yet high-potential therapeutic targets. Therefore, this thesis addresses the two components of protein motif-mediated interfaces: the motif and the binding pocket. The aim of this study is to discover both sides of the interface computationally by introducing three major contributions. In the first contribution, I focus on the motif-binding pocket side of the interaction by introducing xProtCAS, a tool to pinpoint functional regions on protein surfaces. xProtCAS takes advantage of the structure models derived from the recent revolution in deep-learning protein structure prediction. The tool identifies solvent-accessible surface areas using a geometric algorithm and builds a graph from surface-accessible residues where those residues are graph nodes. The graph connects proximal residues in the 3D space with edges weighted with conservation scores. Then, xProtCAS uses graph algorithms to score the influence of each residue in the graph, which reflects their functionality and high-scoring residue clusters are extracted as putative functional regions. Finally, I apply xProtCAS to the human proteome, discovering thousands of uncharacterised highly-conserved protein surfaces. These regions are ranked based on a statistical model quantifying their conservation compared to the surrounding protein surface. The dataset and the tool (an open-source standalone software and a web server) are made available for public use. In the second contribution, I shift my focus to the motif side of the interaction by presenting FaSTPACE, a fast and scalable algorithm for aligning peptides and extracting motif consensuses from large datasets of peptides. Large peptide datasets have become available due to the recent advances in high-throughput techniques for discovering motif-mediated interactions that produce functionally related peptides, necessitating novel tools, such as FaSTPACE, to process these data. I extensively validated the performance of FaSTPACE on artificially generated data and experimental data from ProP-PD, a high-throughput experiment for motif discovery that produces datasets of hundreds of putative motif-containing peptides. The tool shows accuracy and speed comparable to existing tools. Moreover, it is publicly available as an open-source programming library and as a web server. In the third contribution, I develop methodologies to integrate distinct peptide attributes into single peptide confidence scores for binder/non-binder classification and peptide prioritisation for further validation. I provide a machine learning-based scoring scheme to combine experimental attributes, motif-matching scores, and biological peptide features. The algorithm employs a targeted dimensionality reduction algorithm to produce a single score from the distinct and discriminatory input peptide features. These scores can determine confidence levels indicating peptide functionality and supporting peptide ranking. This scoring scheme is applied to a large ProP-PD dataset to define a set of peptides with high confidence of containing a functional motif. These contributions facilitate the study of existing motif-mediated interactions and the discovery of novel instances. In particular, xProtCAS and FaSTPACE provide speed and scalability that did not exist in previous tools. The improved level of computational performance has permitted large datasets that have become available in recent years to be processed rapidly and accurately to derive novel biological insights. Additionally, the proposed machine learning scoring scheme for ProP-PD peptides enhances the processing of peptide screen data, improving the identification of biologically relevant motifs. Finally, I discuss putative future directions, particularly the potential of integrating the introduced tools with recent advances in deep learning and protein language models.

Citation

2024

DOI

Source Title

Publisher

Institute of Cancer Research (University Of London)

ISSN

eISSN

Research Team

Short Linear Motif

Notes