TRAFICA: An Open Chromatin Language Model to Improve Transcription Factor Binding Affinity Prediction
bioRxiv – November 02, 2023
Source: medRxiv/bioRxiv/arXiv
Summary
By focusing on open chromatin regions, a new model enhances our ability to predict how transcription factors bind to DNA. This innovative approach outperformed existing tools by integrating data from in vivo experiments, showcasing the importance of real biological contexts in understanding gene regulation.
Abstract
In silico transcription factor and DNA (TF-DNA) binding affinity prediction plays a vital role in examining TF binding preferences and understanding gene regulation. The existing tools employ TF-DNA binding profiles from in vitro high-throughput technologies to predict TF-DNA binding affinity. However, TFs tend to bind to sequences in open chromatin regions in vivo, such TF binding preference is seldomly considered by these existing tools. In this study, we developed TRAFICA, an open chromatin language model to predict TF-DNA binding affinity by integrating the characteristics of sequences from open chromatin regions in ATAC-seq experiments and in vitro TF-DNA binding profiles from high-throughput technologies. We applied self-supervised learning to pre-train TRAFICA on over 13 million nucleotide sequences from the peaks in ATAC-seq experiments to learn the TF binding preference in vivo. TRAFICA was further fine-tuned using the TF-DNA binding profiles from PBM and HT-SELEX technologies to learn the association between TFs and their target DNA sequences. We observed that TRAFICA significantly outperformed both machine learning-based and deep learning-based tools in predicting in vitro and in vivo TF-DNA binding affinity. These findings indicate that considering the characteristics of sequences from open chromatin regions could significantly improve TF-DNA binding affinity prediction, particularly when limited TF-DNA binding profiles from high-throughput technologies are available for specific TFs.