Unveiling Hidden Patterns: Novel Computational Techniques for Protein Motif and Motif-Binding Pocket Identification
Loading...
Embargo End Date
2025-04-17
ICR Authors
Authors
Ali, H
Document Type
Thesis or Dissertation
Date
2024-10-17
Date Accepted
Abstract
Protein-protein interaction interfaces are crucial in cellular functions. Protein motif-mediated
interactions play key regulatory roles in the cell and are under-explored yet high-potential therapeutic
targets. Therefore, this thesis addresses the two components of protein motif-mediated interfaces: the
motif and the binding pocket. The aim of this study is to discover both sides of the interface
computationally by introducing three major contributions.
In the first contribution, I focus on the motif-binding pocket side of the interaction by
introducing xProtCAS, a tool to pinpoint functional regions on protein surfaces. xProtCAS takes
advantage of the structure models derived from the recent revolution in deep-learning protein
structure prediction. The tool identifies solvent-accessible surface areas using a geometric algorithm
and builds a graph from surface-accessible residues where those residues are graph nodes. The graph
connects proximal residues in the 3D space with edges weighted with conservation scores. Then,
xProtCAS uses graph algorithms to score the influence of each residue in the graph, which reflects
their functionality and high-scoring residue clusters are extracted as putative functional regions.
Finally, I apply xProtCAS to the human proteome, discovering thousands of uncharacterised
highly-conserved protein surfaces. These regions are ranked based on a statistical model quantifying
their conservation compared to the surrounding protein surface. The dataset and the tool (an
open-source standalone software and a web server) are made available for public use.
In the second contribution, I shift my focus to the motif side of the interaction by presenting
FaSTPACE, a fast and scalable algorithm for aligning peptides and extracting motif consensuses from
large datasets of peptides. Large peptide datasets have become available due to the recent advances in
high-throughput techniques for discovering motif-mediated interactions that produce functionally
related peptides, necessitating novel tools, such as FaSTPACE, to process these data. I extensively
validated the performance of FaSTPACE on artificially generated data and experimental data from
ProP-PD, a high-throughput experiment for motif discovery that produces datasets of hundreds of
putative motif-containing peptides. The tool shows accuracy and speed comparable to existing tools.
Moreover, it is publicly available as an open-source programming library and as a web server.
In the third contribution, I develop methodologies to integrate distinct peptide attributes into
single peptide confidence scores for binder/non-binder classification and peptide prioritisation for
further validation. I provide a machine learning-based scoring scheme to combine experimental
attributes, motif-matching scores, and biological peptide features. The algorithm employs a targeted
dimensionality reduction algorithm to produce a single score from the distinct and discriminatory
input peptide features. These scores can determine confidence levels indicating peptide functionality
and supporting peptide ranking. This scoring scheme is applied to a large ProP-PD dataset to define a
set of peptides with high confidence of containing a functional motif.
These contributions facilitate the study of existing motif-mediated interactions and the
discovery of novel instances. In particular, xProtCAS and FaSTPACE provide speed and scalability
that did not exist in previous tools. The improved level of computational performance has permitted
large datasets that have become available in recent years to be processed rapidly and accurately to
derive novel biological insights. Additionally, the proposed machine learning scoring scheme for
ProP-PD peptides enhances the processing of peptide screen data, improving the identification of
biologically relevant motifs.
Finally, I discuss putative future directions, particularly the potential of integrating the
introduced tools with recent advances in deep learning and protein language models.
Citation
2024
DOI
Source Title
Publisher
Institute of Cancer Research (University Of London)
ISSN
eISSN
Collections
Research Team
Short Linear Motif
