Antimicrobial peptides (AMPs) are small peptides (operationally defined here as proteins with 10-100 amino acid residues). These small proteins are able to disturb microbial growth and can be produced by a wide range of organisms from all domains of life. These peptides are ancient defense molecules and some recently have shown activity against multi-resistant pathogens. Besides their potential to clinical applications, AMPs also are widely used in food preservation and agriculture.
Because of their small size, AMPs cannot be detected with high confidence by standard gene prediction methods. Therefore, AMPs are often discarded (Santos-Júnior et al; PeerJ 2020).
Recently, we developed Macrel, which unlike other methods, was developed for metagenomes. Using Macrel, we started the Global Survey for AMPs, a project which intends to collect all AMP sequences available in the publicly-available databases to date. Thus, motivated to develop a database-assisted platform that provides comprehensive functional and physicochemical features of large-scale (meta)genomic-derived AMPs, we created AMPsphere!
Just as the ecosphere is the worldwide sum of all ecosystems, the AMPsphere comprises the complexity of prokaryotic AMPs assembled in one dataset. To date, we analyzed the 86k high-quality genomes in proGenomes2, and over 63k metagenomes from ENA, NCBI and JGI. After redundancy removal, we produced a collection of AMPs from the global microbiome, containing 863,498 distinct sequences, clustered into 10,715 AMP families with at least 7 sequences each.
AMPSphere is deposited in Zenodo under DOI: 10.5281/zenodo.4574468.
AMPSphere is available as a web resource that displays each AMP for browsing by family, location, and samples where it was found.
In the individual AMP cards, information such as pI, charge, molecular weight, hydrophobicity, the proportion of charged residues, and the probabilities associated with the predictions of its antimicrobial and hemolytic activities are available.
Collected AMPs can be mapped back to species, NCBI taxID accession codes, and also the accession of (meta)genomes. Besides that, AMP families also have pre-calculated phylogenetic trees, hidden Markov models (HMM), and alignments available.
AMPSphere has two tools for sequence searching by homology direct alignment and HMM profiles of families preloaded with our database.
Questions can be sent to the mailing list dedicated to the project. To enable local analyses, the complete database is available for download
Advanced users can download all our data or query it using the API.
This project is conducted in collaboration with the de la Fuente Lab at UPenn, the Bork Group at EMBL and the Huerta-Cepas group at CPGB.
We would like to acknowledge in particular: M. D. Torres, T.S.B. Schmidt; A.N. Fullam; P. Bork; X. Zhao and J. Huerta-Cepas.
Several quality tests were applied to the small open reading frames (smORFs) to evaluate them for quality. While none are definite, they do select for higher quality smORFs. Details can be found in (Santos-Júnior et al., 2023), but we summarize them here:
AntiFam is a tool to detect spurious ORFs by matching them against a database of (1) erroneous Pfam families built in the past from erroneous gene predictions; (2) translations of commonly occurring non-coding RNAs such as tRNAs.
The antifams are recurrently used as a quality control step for protein sequence databases of diverse origins, such as genomes and metagenomes. The details of Antifams are available in Eberhardt et al. (2012). SmORFs from GMSC and AMPSphere were searched for Antifams v.7.0 using HMMSearch (command: hmmsearch --cut_ga AntiFam.hmm AMPSphere.fasta
).
Predicted smORFs used for the AMPSphere all contain both START and STOP codons. Nonetheless, as START codons may be difficult to distinguish from other ATG codons, we cannot guarantee that all smORFs are complete. However, if there is an inframe STOP codon upstream of the START codon, the smORF can be considered as a complete ORF.
RNAcode predicts protein coding regions based on evolutionary signatures typical of protein genes. Some of the analyses include the synonyomous/conservative nucleotide mutations, conservation of the reading frame and absence of internal stop codons. For more details, see Washietl et al. (2011).
The RNACode analysis depends on a set of homologous and non-identical genes. Therefore, it can only be applied to families of smORFs. In particular, we applied RNACode to smORFs encoded by at least 8 gene variants.
This test checks whether we can detect the smORF in a set of 221 publicly available metatranscriptomes, comprising human gut (142), peat (48), plant (13), and symbiont (17) [full list].
Using bwa), the short reads from the metatranscriptomes were mapped against the smORF sequences. A smORF was considered to be present in metatranscriptomics data when at least 1 read was mapped in a minimum of two samples (NGLess was used to count the number of reads mapped to each smORF).
Following a similar rationale as used to detect smORFs in metatranscriptomes, we searched for smORFs in metaproteomes data available in PRIDE. A total of 109 publicly-available metaproteomes from 37 environments was used [full list].
We used the test introduced by Ma et al. (2022) whereby a peptide is considered to be present in a metaproteome if a kmer covering at least half of the peptide sequence is found in the metaproteome (exact matches only).