DNA

Abstract Edit

Three ancient mummified remains, linked to the Nazca mummies, have prompted speculation about their origins, particularly in communities interested in the possibility of extraterrestrial life. We analyzed the DNA of these three remains to see what genetic information might be there.

All three samples showcased aged, degraded DNA, typical of ancient remains, and were riddled with contamination from minuscule organisms, mainly bacteria—common for environmentally exposed samples. Human DNA emerged in all three mummies, with one aligning quite significantly with the human genome, but in a way that creates more questions than answers. Diving deeper into the unmatched DNA snippets, we assembled them, finding that most that were classifiable matched with known bacteria. For the mummy with substantial human-like DNA, an extra mitochondrial DNA check of maternally inherited DNA showed its membership in the human mtDNA lineage "M20a". Instead of linking to pre-European-contact Americas lineage, it flagged a specific southeast Asian maternal lineage, suggesting origins beyond a millennium-old Peruvian cave and opening up a range of further questions.

Through different alignments, assemblies, and analyses, our findings–encapsulated in SPAdes and Megahit assemblies, hg38 alignments, VCFs and reports, and available once we find hosting for it–suggest ancient DNA coupled with contamination. Nearly half the reads remained unclassified across all samples.

Further analysis could yield more insights. This might, for instance, include BLASTn/p analysis of contigs, refining MEME motifs to exclude repeats, or running anomaly detection on the alignments, unmapped reads, and de novo assemblies—all of which we're sharing for further analysis.

While there’s no evidence of extraterrestrial species, this project provides a clear roadmap and methodology for scientific inquiry into the unknown. We’ve shared our methods and findings on GitHub, and are seeking a secure digital haven for our cleaned-up data. This venture underscores science as an open door to probing some of the most interesting questions of our times, inviting anyone who is interested to join this genomic journey into the Nazca mummies.

Introduction Edit

Background and context Edit

Many early South American groups practiced mummification. In many arid and high-altitude climates, bodies would naturally desiccate, which several groups used to their advantage in preserving and honoring their dead. Several examples of these bodies have been discovered and examined by scientists since the 19th century (MUMMYSTUDIES).

Between 2016-7, multiple tridactyl—i.e., three-fingered—Nazca mummies were discovered in Peru. One group of five tridactyl mummies was found in a tomb in Nazca, Peru in January 2016 (link, link), while another (named Maria) was found in a catacomb on the outskirts of Nazca, Peru in June 2017. Various reports suggest the mummies date back as far as 6,500 years, with some being found in a Peruvian tomb and examined by Russian scientists who claimed a different anatomical structure to humans (link, link).

The mummies were taken into custody by the Universidad Nacional San Luis Gonzaga in Peru, which aimed to exhibit them and promote their research. An event on November 6, 2019, at the university marked the official recovery of some of these mummies and launched an international call for researchers for the scientific study of these relics.

On September 13, 2023, UFO enthusiast and journalist Jaime Maussan presented two of these mummies to the Mexican Congress, claiming they were about 1,000 years old based on carbon dating process conducted by Mexico's National Autonomous University (UNAM). However, the scientists from UNAM only determined the age of the samples provided and did not draw conclusions about their origins. (link, link, link, link).

The tridactyl Nazca mummies unearthed in Peru have captivated the imagination of both the public and certain factions within the scientific community. With their three-fingered hands and claims of extraterrestrial origin, these relics evoke the thrilling possibilities of the unexplored and the unknown, while also offering opportunities for serious and destigmatized scientific exploration.

The mummies’ DNA data is freely accessible (Ancient0002, Ancient0003, Ancient0004), yet the expertise to interpret them is not as widespread. Thus, we identified an opportunity to apply robust scientific methodologies to explore the secrets within the DNA of these mystifying entities—and in doing so, demystify the narrative surrounding them.

Bioinformatics as a key methodological approach Edit

Bioinformatics marries biology with computer science, unlocking the ability to dissect the genomic essence of these ancient beings. Because of its ability to process large amounts of biological data, bioinformatics can analyze dense and diverse samples as well as identify unexpected connections.

For example, as we discuss later, the discovery of mtDNA haplogroup "M20a" in one of the samples was a notable find, linking it to an Asian lineage rather than the expected pre-European-contact Americas lineage. Analysis on this scale would not have been possible without the tools that bioinformatics offers.

Research question Edit

Our quest was simple: Analyze the Whole Genome Sequencing (WGS) samples of these mummies, unearthing any genomic insights they may harbor. This reflects a broader twofold goal: to collect credible data about the genomic information contained in the samples, and to democratize scientific inquiry. The latter is a crucial project in an era where information is abundant yet misinformation is rampant, and we aimed to empower other people with the knowledge and tools to delve into such intriguing questions themselves.

Methods Edit

Data preparation Edit

We began with the three samples of ancient DNA from the NCBI Sequence Read Archive (SRA), uploaded by the Instituto Politécnico Nacional of Mexico in 2022. The samples, named Ancient0002 and Ancient0004, came from a mummy nicknamed "Victoria," while Ancient0003 came from a yet-unnamed mummy.

To prepare clean data for analysis, we followed a series of steps to improve data quality.

Using standard SRA-Toolkit prefetch, we pulled the data from SRA, and fasterq-dump helped split it into readable parts. To tackle duplicate reads we used BBMap, and trimmomatic helped trim away any undesired adapters from the sequencing process, leaving us with neat sequences ready for the analytical stage.

Then, to evaluate and control quality, we ran FastQC to assess the quality of our samples. Our deduplication was moderate, and so even after processing Ancient0002 and Ancient0004 still had relatively high duplication levels (31.62% and 39.79% respectively), which suggest degradation or contamination leading to artifacts in the amplification process. On the other hand, Ancient003 had a lower duplication level of 15.29%. Besides a small rise in GC content for Ancient0003—still consistent with human DNA—all other quality metrics stayed within normal ranges, indicating we were working with ancient DNA and could proceed.

Taxonomic classification Edit

We chose Kraken2 (KRAKEN2) for taxonomic classification. It’s fast and accurate, producing useful and easy-to-analyze data. In addition, Kraken2 is popular, well-understood, and actively maintained in the scientific community.

We chose the comprehensive “nt” database to run Kraken2 against. Using this well-established reference required a robust computer setup: provisioning a computer with over half a terabyte of RAM to hold the database and ensure the process ran smoothly and swiftly. The investment was well worth it as it maximized our chances of catching meaningful matches from our samples.

We visualized the results from Kraken2 using Krona (KRONA), creating an interactive chart that laid out the taxonomic classification. This enabled not only our analysis but future review by others.

This step informed us we were looking at metagenomic data, and helped tailor our choice of tools for the steps that followed, steering the analysis in the right direction. This helped us get a broad sense of the landscape and prepare for the deeper analysis ahead.

Alignment and assembly Edit

The next goal was to analyze the DNA sequences from our ancient samples in two different ways: alignment and assembly.

Alignment

Alignment refers to matching the DNA sequences from our samples to a known reference, in this case, the human genome (hg38). We used Bowtie2 (BOWTIE2) for this task, a tool known for its speed and efficiency, making it a good fit for handling our data.

Since NCBI’s classification hinted at a significant presence of human DNA in our samples, we chose to align to the human genome. This would help us explore the human DNA aspect and to evaluate claims of genetic engineering associated with these samples.

Assembly

Assembly refers to piecing together the DNA sequences to form longer, more complete sequences, without using a reference. We used two tools for this: SPAdes (SPADES) and MEGAHIT (MEGAHIT). Both are well-regarded in the community for their accuracy and speed. Running both allowed us to cross-verify the results, ensuring a more reliable analysis. MEGAHIT was set to handle large metagenomic data, and SPAdes was run as part of a pipeline called shovill that also further cleaned the reads. This way, we could get a clearer picture of the DNA, especially the parts that didn't align to the human genome.

This repo was forked from the original shovill repo to add functionality that let us select only specific parts of the pipeline. We ran into resource issues without doing this.

We used these tools with standard settings, making only minor adjustments to Bowtie2 to use a --local setting for a more flexible alignment, and --presets meta-large on MEGAHIT to specify a high-diversity metagenome . This approach kept the process straightforward and effective, setting a solid foundation for the deeper analysis that followed.

Binning Edit

Binning is a way to group together DNA sequences that likely come from the same organism. This step helped us organize our data into more manageable chunks, making it easier to handle, and enabling possible insights into what organisms might be present in our samples. Workflow was as follows. We started with the assembled contigs. (These are longer stretches of DNA that we pieced together in the previous assembly step.) Then, we attempted to segregate them into bins, each representing a potential different organism. We used a script to automate the binning process, which did three things:

1. Subsetting contigs: Before diving into binning, we took a random subset of our contigs using a tool called Seqtk. This step made the process a bit more manageable.

2. Binning with MetaBAT2: We then used a tool called MetaBAT2 (METABAT2) to perform the actual binning. This tool grouped our contigs into bins based on certain characteristics they share.

3. Quality checking with CheckM: Lastly, we ran a tool called CheckM (CHECKM) on our bins to assess their quality. This step ensures that the bins we've created are reliable for further analysis.

We ran this script twice, once for each set of contigs we obtained from the two different assembly tools, SPAdes and MEGAHIT. This allowed us to compare and replicate results.

The modified shovill script was flexible, allowing us to skip any of the steps if needed, thanks to the flags --no-subset, --no-maxbin, and --no-checkm. This enabled us to tailor the workflow to our needs.

Haplotyping Edit

Haplotypes are a group of DNA variants along a single chromosome that are likely to be inherited together. Haplotyping—i.e., looking for these groups—can tell us about the genetic differences within a population, as well as the potential lineage of a sample’s origins.

Here, we examined the Ancient0003 sample, the most likely candidate to yield insights given its strong alignment to the human genome, with potential richness of mitochondrial DNA (mtDNA). We hypothesized that if this sample is indeed ancient and was found in a cave in Peru, its mtDNA should reveal a lineage predating European colonization in the Americas.

Workflow

We initiated the process with variant calling to identify variations in DNA sequences. This was done using tools like samtools and bcftools. We needed to ensure the quality of these variants, so we calculated various statistics like mean and standard deviation of quality scores.

The next step was to filter these variants for quality, focusing on those with a quality score above the median score of 69, to narrow down to more reliable data. Following this, we filtered down further to only include SNPs (single nucleotide polymorphisms), the most common type of genetic variation among people. We then moved on to phasing with whatshap. Phasing helped us determine which variants are associated with each other, a crucial step for understanding the haplotypes.

Finally, we focus on the mitochondrial variants using bcftools, and ran haplogrep to classify the haplogroups of the mtDNA against the MitoImpute panel. This would help us understand the lineage of our ancient sample.

Results Edit

Overall, our analysis showed samples consistent with degraded ancient DNA. The population of each sample was consistent with the metagenomic profile one would expect from a sample of tissue extracted from an ancient mummy. The classified DNA reads were largely identified as bacterial, with some presence of human-aligned DNA; in particular, Ancient0003 showed a very strong alignment with the human genome, leading us to investigate ancestral roots of its mitochondrial DNA.

DNA composition and classification Edit

Our quality control analysis showed high duplication levels, consistent with the DNA degradation found in ancient DNA samples. Our classification suggested a vibrant and rich community of microbial life that had made these ancient remains their home. Between 49-52% of the raw reads in each sample were unclassified, not unusual in ancient DNA samples. Our kraken2 analysis, visualized in Krona and available on our site, showed that the classified reads were dominated by bacterial DNA (Ancient0002, Ancient0003, Ancient0004). This is consistent with what might be expected from ancient DNA samples that have not been cleaned or processed, and does not indicate anything unusual.

Mitochondrial DNA Analysis Edit

Ancient0003, the sample with >95% alignment to the human genome, yielded a particularly interesting find. Using a MitoImpute reference panel constructed from 36,960 mitochondrial genomes, we identified a basic membership in haplogroup "M," a very broad maternal haplogroup commonly found across East and Southeast Asia. In that panel, haplogroup “M” is not mapped at a high resolution, so we investigated further and found a panel that identified a proposed haplogroup, “M20a”, identified in a multiethnic population of 327 citizens of Myanmar located in Northern Thailand. All of Ancient0003’s mutations were identified in samples MMR137 and MMR317 (raw data archived here), strongly matching proposed haplogroup M20a (Summerer et al., 2014). This was a compelling deviation from the expected result, as it hinted at a genetic thread that spanned across continents from South America to a very specific population in southeast Asia. There were no identified subclades that could link the mtDNA to the Americas, adding a layer of intrigue to the human lineage narrative.

Assembly and alignment Edit

The samples all showed some alignment with the human genome, with the two samples from “Victoria” showing less (10.88% of the deduplicated Ancient0002 reads aligned to the human genome, and 12.02% of the Ancient0004 reads aligned), while Ancient0003, from an unspecified mummy, showed 95.69% alignment to the human genome.

With the remaining unaligned reads, we conducted de novo assemblies using both Spades and Megahit. After binning both assemblies from each sample with MetaBAT2 (Ancient0002 megahit and spades; Ancient0003 megahit and spades; Ancient0004 megahit and spades), we found generally consistent binnings with lineages, though the percentages of the lineages differed between the SPAdes and Megahit binnings. In general, the Megahit assembly showed lower contamination, while SPAdes showed higher completeness.

These assemblies and alignments are now neatly packaged, ready to be delved into for anyone interested in extending this genomic narrative.

For all three samples, the contigs we acquired post-assembly were primarily classified (binned) as bacteria and archaea, aligning with the overarching microbial theme of our findings.

Discussion Edit

Our initial investigation began with a simple premise: to sift through the DNA and see what the data suggest.

Along with varying alignments (ranging from 10% to 95%) with the human genome, we discovered a wide range of ancient microbial life and an unexpected genetic linkage to Myanmar, specifically through the mitochondrial DNA haplogroup "M20a" in the sample Ancient0003. The lack of American subclades and the Asian lineage proposes new questions about the origins of the human DNA in that sample. Was it the most likely explanation–a simple mis-handling of the samples along the chain of custody–or does it enrich the story of human migration? How did this DNA thread from Myanmar weave its way into a Peruvian mummy? This deviation from the anticipated American lineage invites a wealth of questions regarding human migration and potential intercontinental connections in antiquity.

The unclassified reads, contigs, and assemblies suggest several avenues for further analysis—BLASTing samples of the contigs, re-running the MEME motifs to exclude repeats, delving into anomaly detection on the alignments and assemblies, and so on.

Importantly, our inability to find evidence of extraterrestrial origin does not discredit the terrestrial nature of these mummies. DNA chemistry as we know it evolved on Earth, tailored by Earth's unique conditions. There's no given reason to believe that extraterrestrial life, if it exists, would share the same molecular structures. For instance, if we were to analyze a soft robot, we wouldn’t find alien DNA. We might instead come across various microbial populations residing on or within it. Hence, the findings from this investigation hold significance only if the premise is that these mummies should possess recognizable DNA.

Conclusion and future questions Edit

The lack of evidence supporting extraterrestrial origin in no way undermines the terrestrial nature of these mummies. It merely underscores the limitations of our current understanding and the specific molecular framework we operate within.

It is also worth mentioning that these DNA samples pre-date the mummies that have so captivated researchers and enthusiasts since the Mexican hearing; we do not have samples for “Josephina” or “Maria”; and we do not know the attributed origin of the Ancient0003 sample. Our pipelines are reproducible, and we are eager to conduct further analysis on additional sample runs from the more recently reported mummies.

The results showing a Myanmar-related lineage on Ancient0003 rather than an indigenous Peruvian lineage are perplexing, and stand out as subjects for future investigation. If we were to take further steps in this direction, we would compare mtDNA and Y-chromosome haplotypes across all three samples as well as against sets of other ancient DNA samples, especially those from the Americas.

The assemblies and alignments of all samples, along with the unclassified reads, contigs, and potential for further detailed analysis, present a rich ground for continued exploration. Anyone interested can take a closer look, run BLAST on the contigs, and perhaps unveil further insights that we might have missed.

The genomic data we've unearthed serves as a stepping stone, inviting curious minds to extend the narrative, to probe deeper into the genomic essence of these ancient remains. Our findings have broadened the discourse, offering a solid foundation for future investigations.

Supplemental materials Edit

References Edit

(MUMMYSTUDIES) Lombardi G, Arriaza B. South American Mummies. In: Shin DH, Bianucci R, editors. The Handbook of Mummy Studies: New Frontiers in Scientific and Cultural Perspectives. Singapore: Springer Singapore; 2020. p. 1–14.

(MMRPAPER) Summerer M, Horst J, Erhart G, Weißensteiner H, Schönherr S, Pacher D, Forer L, Horst D, Manhart A, Horst B, Sanguansermsri T, Kloss-Brandstätter A. Large-scale mitochondrial DNA analysis in Southeast Asia reveals evolutionary effects of cultural isolation in the multi-ethnic population of Myanmar. BMC Evol Biol. 2014 Jan 28;14:17. doi: 10.1186/1471-2148-14-17. PMID: 24467713; PMCID: PMC3913319.

(KRAKEN2) Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0 (KRONA) Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011 Sep 30;12:385. doi: 10.1186/1471-2105-12-385. PMID: 21961884; PMCID: PMC3190407.

(BOWTIE2) Langmead, B., Trapnell, C., Pop, M. et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.Genome Biol 10, R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25

(SPADES) Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., & Korobeynikov, A. (2020). Using SPAdes de novo assembler. Current Protocols in Bioinformatics, 70, e102. doi: 10.1002/cpbi.102

(MEGAHIT) Li D, Liu CM, Luo R, Sadakane K, Lam TW. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015 May 15;31(10):1674-6. doi: 10.1093/bioinformatics/btv033. Epub 2015 Jan 20. PMID: 25609793.

(METABAT2) Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019 Jul 26;7:e7359. doi: 10.7717/peerj.7359. PMID: 31388474; PMCID: PMC6662567.

(CHECKM) Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. Epub 2015 May 14. PMID: 25977477; PMCID: PMC4484387.