A strategy for aggregating multi-source historical phenotypic and genotypic data sets containing homonyms for global genomic prediction in apple (Malus domestica)
Genomic prediction can be used to combine historical phenotypic and genotypic data sets from multiple sources to match novel germplasm with new production environments. Implementation of genomic prediction in apple (Malus domestica Borkh.) benefits from accurate matching of identities of genetic treatments (i.e., accessions) and SNP marker loci across large data sets from multiple experiments and trials. However, data collection and formatting methods differ among data sources. Thesauri can be used to integrate data sets so that they can be aggregated. For apple, we developed scripts to produce thesauri that standardize accession names and SNP locus identifiers across the RosBREED, FruitBreedomics, and Australian Grove genomic data sets generated from three SNP genotyping platforms. One challenge of aggregation is the presence of errors in the data which lead to homonyms (non-uniqueness in a name used to refer to a specific accession or its clone). To correctly label homonyms in these data sets, the thesauri were primed with historical data (training data sets) labeled with Malus UNiQue identifiers (MUNQ IDs) and SNP marker information. The resultant scripts revealed these training data sets also contained homonyms caused by: 1) misassigned MUNQ IDs to accession name; 2) misassigned accession identifier to MUNQ ID/accession name; 3) misassigned SNP identifiers to 48 markers; and 4) potentially incorrect records in international databases leading to ostensible inferences about the accession name. To resolve these homonyms, the scripts were extended to identify potential errors in published historical data sets, correct for resolvable data processing errors, and append accession ID or source to the accession name where the source error of the homonym could not be determined. Correcting these homonyms has facilitated the aggregation of approximately 2184 unique accessions across 259,850 SNP loci from the three genotyping platforms.
Edge-Garza, D., Evans, K., Ross, E.M., Jung, S., Main, D. and Hardner, C. (2023). A strategy for aggregating multi-source historical phenotypic and genotypic data sets containing homonyms for global genomic prediction in apple (Malus domestica). Acta Hortic. 1362, 123-130
globally unique identifier, passport identifier, data curation