DreaM: A Computational Pipeline for Enhanced Short-Read Sequence Analysis in Repetitive Genomic Regions

bioRxiv – November 12, 2024

Source: medRxiv/bioRxiv/arXiv

Summary

Short-read sequencing often struggles with repetitive genomic regions like centromeres, leading to potential misinterpretations. A new computational pipeline effectively removes duplicates and refines data, resulting in clearer detection of DNA-protein interactions. This innovation enhances our ability to study these complex areas accurately.

Abstract

Mapping short sequencing reads to repetitive genomic regions, such as centromeres, presents significant challenges, primarily due to PCR duplicates, which can be erroneously mapped multiple times within these regions. Conventional bioinformatics pipelines often overlook this issue, potentially leading to misinterpretation as signal enrichment. To address this, we developed DreaM (Deduplication of Reads for Enhanced and Accurate Mapping), a computational pipeline that prioritises the preprocessing of raw sequencing data. DreaM firstly identifies and removes PCR duplicates, which is followed by read trimming to reduce noise from multiply mapped reads. When applied to ChIP-Seq and CUT&RUN datasets targeting CENP-A, a key marker of centromeres, DreaM demonstrated improved peak detection within centromeres. Overall, DreaM provides a robust solution for enhancing the analysis of DNA-protein binding sites in repetitive genomic regions using short-read sequencing.

Tags