Not logged in
PANGAEA.
Data Publisher for Earth & Environmental Science

Meyer, Britta S; Hablützel, Pascal I; Roose, Anna K; Hofmann, Melinda J; Salzburger, Walter; Raeymaekers, Joost AM (2018): Alignment of genetic variants of the MHC class II loci of East African cichlids [dataset]. PANGAEA, https://doi.org/10.1594/PANGAEA.893876, In supplement to: Meyer, BS et al. (2019): An exploration of the links between parasites, trophic ecology, morphology, and immunogenetics in the Lake Tanganyika cichlid radiation. Hydrobiologia, 832(1), 215-233, https://doi.org/10.1007/s10750-018-3798-2

Always quote citation above when using data! You can download the citation in several formats below.

RIS CitationBibTeX Citation

Abstract:
First, the generated raw reads (11,569 reads) were processed with Roche's demultiplexing and converting tools (sffinfo, sfffile) and sequences of primer annealing sites were removed. For quality filtering we applied a filter for too short reads (≤ 150 bp), we only allowed 1% of ambiguous bases (N) and filtered out low quality sequences (Mean ≥ 15). These sequences were imported into Geneious (6.1.6 Biomatters Ltd, www.geneious.com) and de novo assembled (with custom sensitivity: minimum overlap identity of 95 % and maximum ambiguity 4 using all reads from one species. This resulted in contigs of single individuals with highly identical reads (pairwise identity: median 99.50 %) and contigs of several individuals sharing these reads (pairwise identity: median 99.40 %). The coverage ranged from 2 to 131 for single individual contigs and 2 to 337 reads for contigs originating from multiple individuals. We also kept low coverage contigs as we use our data for measuring genetic diversity among tribes and not for investigating functionality or selection processes (indicated with suffix “low” in the alignment). However, if more than 3 bp of a read were different than the rest of the contig, the read was excluded and also singletons, which differed dramatically (≥ 10 mutations) to other contigs, were removed from the data set (reads N=517). Consensus sequences were generated within Geneious using 50 % strict rule from each contig and for each individual. Most homopolymer regions were correctly called with these settings and ambiguous positions were coded according to IUPAC rules. The obtained variants were aligned using MAFFT (--auto; 200PAM/k=2, 1.53 open penalty/0.123 offset) (Katoh & Standley, 2013) and insertions of ambiguous positions, homopolymers and misalignments were manually checked. This resulted in an alignment of 751 base pairs containing both intronic and exonic regions. A blast search of the alleles led to the exclusion of further sequences (removed contigs N=266). In a next step we shortened the alleles to exon 2 only, in order to (i) reduce our data set to coding nucleotides and (ii) to reduce the amount of missing data and ambiguities. This resulted in a total number of 844 MHC exon 2 variants of 160 bp lengths. Despite our methodological limitations, short reads and relatively low sequence coverage for some contigs, our results are valid as a valuable measurement of immunogenetic diversity that is comparable across all tribes of Lake Tanganyika cichlids as these biases are expected to be similarly distributed across the different tribes.
The available alignment was generated as described above and saved as a fasta file. It includes 844 MHC exon 2 variants (160 base pairs). The header of each entry includes information about the scientific species name (e.g. Tropheus moorii) with a specific variant name (e.g. 04_01). This can be followed by the indication "low", which was added if the variant was called with very low coverage (<=2x).
Size:
9.1 kBytes

Download Data

Download dataset