HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) contains transcription factor (TF) binding motifs represented as classic Position Weight Matrices (PWMs, also known as Position-Specific Scoring Matrices, PSSMs).
The PCM to PWM conversion scheme used in HOCOMOCO follows that of MACRO-APE, see the respective manual, page 20–21. Uniform background was used in this process, as well as when estimating the downloadable threshold-to-P-value tables.
HOCOMOCO motifs were constructed with ChIPMunk by systematic motif discovery from thousands of ChIP-Seq and HT-SELEX datasets. Please refer to the HOCOMOCO v12 paper for more details on the motif discovery procedure. And to the Codebook MEX paper for details on data sources and motif discovery pipeline for v13 update.
[Motif finding; Sequence scanning]
HOCOMOCO provides PWMs accompanied by precomputed score thresholds. The thresholds and P-value for HOCOMOCO v13 motifs are estimated against uniform background probabilities. To interactively visualize predicted TFBS in a small set of sequences we provide MoLoTool. For large-scale analysis, we suggest using command-line tools, such as our SPRY-SARUS or MEME's FIMO.
[Motif benchmarking; Performance metrics]
To assemble the motif collection of HOCOMOCO v13 we have used multiple benchmarking protocols evaluating the motif performance for TFBS recognition in genomic regions (in vivo data: ChIP-Seq), in artificial oligonucleotides (in vitro data: HT-SELEX, GHT-SELEX, SMiLE-Seq and PBM), and for predicting regulatory single-nucleotide variants and polymorphisms (rSNPs). Please refer to the HOCOMOCO v12 paper and Codebook MEX paper, and Codebook MEX website for more details on benchmarking protocols and resulting performance metrics.
Each model in the collection has a quality rating from A to D where A represents motifs with the highest confidence. A quality motifs and subtypes were found in at least two types of assays (ChIP-Seq, HT-SELEX, GHT-SELEX, SMiLE-Seq or PBM), B quality motifs are found in at least two different experiments of the same type, and C quality motifs passed expert curation but were found in a single experiment. In the core collection, D quality marks subtypes which included only motifs inherited from HOCOMOCO v11, and in v13 there are only a few such cases. In sub-collections, D quality denotes all motifs not tested in the respective benchmarks (ChIP-Seq for v13-invivo, neither of HT-SELEX, GHT-SELEX, SMiLE-Seq, PBM for v13-invitro, rSNP for v13-rsnp).
Since v11 the alternative binding motifs of a particular TF are ranked from 0 (the primary model) to 1,2,.. (the alternative motifs). The motifs of 0 rank are the most 'general' variants with the best performance across available data in the benchmark (see the HOCOMOCO v12 paper for details).
HOCOMOCO v12 used two data types for motif discovery: ChIP-Seq and HT-SELEX. The latter came in two variants: traditional HT-SELEX and methyl-HT-SELEX with mCpGs. In HOCOMOCO v13 three additional data types were used: GHT-SELEX (HT-SELEX analogue with input library collected from random genomic sequences), SMiLE-Seq, and PBM. Additionally, in benchmarking, we used information on differential transcription factor binding to single-nucleotide variants obtained in SNP-SELEX and identified from ChIP-Seq (the allele-specific binding, see ADASTRA). Motif name encodes experiment types that yielded motifs assigned to the same subtype during expert curation. We use the following abbreviations of experiment types: P (ChIP-Seq), S (HT-SELEX), M (Methyl-HT-SELEX), G (Genomic HT-SELEX), I (SMiLE-Seq), and B (PBM). Motif name can represent any combination of those six (PSMGIB) for motifs found in several types of experiments.