Source: UNIVERSITY OF TENNESSEE submitted to NRP
STATISTICAL METHODS AND TOOLS FOR RESEARCH SCIENTISTS
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
COMPLETE
Funding Source
Reporting Frequency
Annual
Accession No.
0221030
Grant No.
(N/A)
Cumulative Award Amt.
(N/A)
Proposal No.
(N/A)
Multistate No.
(N/A)
Project Start Date
Jan 1, 2010
Project End Date
Dec 31, 2014
Grant Year
(N/A)
Program Code
[(N/A)]- (N/A)
Recipient Organization
UNIVERSITY OF TENNESSEE
2621 MORGAN CIR
KNOXVILLE,TN 37996-4540
Performing Department
Animal Science
Non Technical Summary
This project proposes to develop training and statistical analysis software tools that will assist researchers to become more independent, and make fewer mistakes. A secondary goal is to increase the capabilities of researchers, by developing tools that provide access to new methods, either more advanced or more specialized. All tools will follow the "open source" philosophy, with code freely available to all. A website, dawg.utk.edu, will provide the training tools, consisting of tutorials that lead scientists through the typical statistical steps of choosing an appropriate experimental design, preparing the data for use with statistical software, and running an appropriate statistical analysis, complete with diagnostic checks of data and model validity, and interpreting the results. The DAWG website will also provide access to the statistical analysis software tools developed by this project. The software will provide an interface between the scientist and the complex SAS and R packages, minimizing input by scientists, and minimizing the chances of incorrect statistical analysis. For example, a typical analysis of variance in SAS might require 30 or 40 lines of code, with a complex set of options. The proposed software would require that the scientist write 2 statements, and the software would run the necessary 30 or 40 lines of code automatically, making correct choices and using the currently accepted optimal statistical methodology.
Animal Health Component
(N/A)
Research Effort Categories
Basic
(N/A)
Applied
(N/A)
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
9017310209090%
9017310108010%
Goals / Objectives
The overall goal is to make modern statistical methods available to researchers, packaged in a way that will reduce the software learning curve. Specific goals are: 1) Further develop the DANDA.SAS macro collection such that all standard statistical analyses in agriculture can be requested with minimal input by researchers. 2) Further develop the DAWG website resource to provide tutorials for all the capabilities of DANDA.SAS. 3) Create tools similar to objectives 1) and 2) based on R software, so researchers without access to SAS may still benefit from open source statistical analysis software. 4) Implement specialized statistical analyses with software that takes full advantage of todays computing hardware.
Project Methods
SAS code will be written in macro format such that a researcher can type in, for example, %mmaov(data,yvar,class=treat,fixed=treat), and obtain a complete SAS statistical analysis with all diagnostic checks on data and analysis automatically run. The dawg.utk.edu website will have tutorial pages developed explaining all steps for using the macros, from data preparation to SAS code to interpretation of output. Similar coding and tutorial developments will be initiated based on the R statistical software. These resources will be promoted by publishing statistical methods articles in society journals, and impact will be measured by monitoring website access and citations in the literature.

Progress 01/01/10 to 12/31/14

Outputs
Target Audience: Target audience is researchers using SAS software for statistical analysis. Focus is on designed experiments, so anyone in academics, business or government generating data from an experimental design is in the target audience. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Researchers can self-train in using SAS for statistical analysis using example datasets and instructions on the DAWG website. How have the results been disseminated to communities of interest? The DAWG website is usually in the top 10 hits from Google searches involving SAS and experimental designs. We have been slow to advertise the website as we are still filling in details, even though the structure is complete. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? This project has accomplished the following for each of the number objectives: 1) danda.sas has been fully developed during this project, with only updates and bug fixes being needed in the final year of the project. danda.sas has been made publically available through the DAWG website and SourceForge, a website home for thousands of open source software projects. 2) The DAWG website offers a step-by-step guide to the use of danda.sas for analysis of a wide variety of experimental designs. The structure of the website is complete, tabs and design module pages are posted. However details such as sample datasets and example outputs are still being filled in, so this objective is about 80% complete. Work will continue even though the project is terminating. 3) Progress continues on creating similar tools for the R software, an increasingly popular free statistics, graphics and data management program, but this objective is only about 25% finished. Realistically an entire project should be devoted to this effort. 4) Each year of the project had two or three large statistical analysis problems that needed substantial computing resources, eg. weeks of computing time and 64+ Gigabytes of memory. These were solved using SAS or R, but solutions were specific to the research data, so general tools can not be produced for general use.

Publications

  • Type: Journal Articles Status: Published Year Published: 2014 Citation: Bastin, B. C., A. Houser, C. P. Bagley, K. M. Ely, R. R. Payton, A. M. Saxton, F. N. Schrick, J. C. Waller, and C. J. Kojima. A polymorphism in XKR4 is significantly associated with serum prolactin concentrations in beef cows grazing endophyte-infected tall fescue. Animal Genetics (2014) 45(3): 439-441
  • Type: Websites Status: Published Year Published: 2014 Citation: http://dawg.utk.edu


Progress 10/01/13 to 09/30/14

Outputs
Target Audience: Target audience is researchers using SAS software for statistical analysis. Focus is on designed experiments, so anyone in academics, business or government generating data from an experimental design is in the target audience. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Researchers can self-train in using SAS for statistical analysis using example datasets and instructions on the DAWG website. How have the results been disseminated to communities of interest? The DAWG website is usually in the top 10 hits from Google searches involving SAS and experimental designs. We have been slow to advertise the website as we are still filling in details, even though the structure is complete. What do you plan to do during the next reporting period to accomplish the goals? Highest priority is finishing Objective 2, filling in training examples on the DAWG website.

Impacts
What was accomplished under these goals? This project has accomplished the following for each of the numbered objectives: 1) Danda.sas has been completed, with only minor enhancements and bug fixes added this year. 2) The DAWG website structure is complete, tabs and design module pages are posted. However details such as sample datasets and example outputs are still being filled in, so this objective is about 80% complete. 3) Progress continues on creating similar tools for the R software, an increasingly popular free statistics, graphics and data management program, but this objective is only about 25% finished. Realistically an entire project should be devoted to this effort. 4) This objective is for specialized software solutions to huge computing problems associated with data analysis. As is typically the case, this year the challenges were gene expression data and QTL analyses. Publications show one example.

Publications

  • Type: Journal Articles Status: Published Year Published: 2014 Citation: Bastin, B. C., A. Houser, C. P. Bagley, K. M. Ely, R. R. Payton, A. M. Saxton, F. N. Schrick, J. C. Waller, and C. J. Kojima. A polymorphism in XKR4 is significantly associated with serum prolactin concentrations in beef cows grazing endophyte-infected tall fescue. Animal Genetics (2014) 45(3): 439-441
  • Type: Websites Status: Published Year Published: 2014 Citation: http://dawg.utk.edu


Progress 01/01/13 to 09/30/13

Outputs
Target Audience: Research scientists world-wide using SAS software to statistically analyze their data. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Researchers can engage in self-training by working through the modules on the DAWG website. How have the results been disseminated to communities of interest? The danda.sas file is easily found using a web search, and is freely downloadable. Similarly the DAWG website is openly accessible to anyone with an internet connection. What do you plan to do during the next reporting period to accomplish the goals? Goal 2) Since this will be the final year of the project, plans are to complete all modules within the DAWG website. We also have publications planned to make the resource citable, which may stimulate use. Goal 3) We plan to have a working (but beta version) copy of a macro that will enable easy use of R. This will at least provide a skeleton that can be filled out by future efforts.

Impacts
What was accomplished under these goals? Goal 1) This goal has been accomplished, but error correction and adding options per user requests has continued. Goal 2) The DAWG website has been approximately half finished, with most of the common experimental design modules now online. Goal 3) Developing similar tools as in Goal 1 for the R software package has continued, with orthogonal polynomial capability now functional. This allows researchers to generate contrasts for treatments that are amounts, such as fertilizer treatments 0-150 lbs/acre. Goal 4) A chestnut restoration project resulted in needing to run repeated-measures analysis of variance on a large dataset. Preliminary runs found the analysis took at least 6 days, impractical given the number of variables and multiple reruns typically needed for repeated measures models. An additional complexity was having binary (not normally distributed) variables. Work-arounds were developed to obtain acceptably accurate results within the limits of feasibility.

Publications

  • Type: Journal Articles Status: Published Year Published: 2013 Citation: Fallen BD, Hatcher CN, Allen FL, Kopsell DA, Saxton AM, Chen P, Kantartzi SK, Cregan PB, Hyten DL, and Pantalone VR. Soybean Seed Amino Acid Content QTL Detected Using the Universal Soy Linkage Panel 1.0 with 1,536 SNPs. Journal of Plant Genome Sciences (2013) 1 (3): 6879


Progress 01/01/12 to 12/31/12

Outputs
OUTPUTS: Objective 1. Two versions of the danda.sas software have been fully tested, and are now released on the SourceForge website (https://sourceforge.net/projects/danda/). Version 2.11 is the last version that will run with SAS version 9.2 or earlier, and will only be updated to fix errors. Version 2.12 has been greatly modified to capitalize on SAS version 9.3 changes. The biggest revisions were to switch completely from Proc Mixed to Proc Glimmix for the mixed model analysis of variance computations, and to remove all use of the SAS/Graphics, instead relying upon the new ODS graphics capabilities. The latter may help users reduce the number of products that must be licensed from SAS Institute. Objective 2. The DAWG instructional website has been completely revised to mirror changes in SAS 9.3 output formatting. This was necessary as the previous formatting looks severely dated once users become accustomed to the SAS 9.3 html format. Objective 3. R software continues to grow in popularity, so creating a danda.sas equivalent for this language has become a higher priority. A preliminary step towards this objective has been accomplished, with the translation of one of the danda.sas macros (pdmix) into an R function. This was used in the Ernst et al. 2012 publication cited below. Objective 4. The largest compute application for this reporting period was the calculation of epistatic effects for soybean breeding. Preliminary runs using the EPISTACY SAS program written by JB Holland showed that computation would take approximately one year, clearly not feasible. Optimization of the SAS code produced an approximately 6-fold increase in speed, and other hardware tricks allowed the program to run in one month. This was for additive*additive interactions, but results suggested additive*additive*additive epistasis might be worth exploring. Given the exponential increase in time required (for example 50,000 SNP genotypes would require 50,000 months!!), this was left as an open question. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
Frequency of questions and requests concerning the danda.sas program indicate that it continues to gain usage within the agricultural research community. SourceForge downloads last year were 48, a positive indicator given the new and unadvertised nature of that resource. Optimization of statistical software and computing hardware to research data has produced answers for soybean breeding research that would have otherwise been impossible to obtain.

Publications

  • Ernest, B.; Gooding, J. R.; Campagna, S. R.; Saxton, A. M. & Voy, B. H. (2012) MetabR: an R script for linear model analysis of quantitative metabolomic data. BMC Research Notes 5: 596. http://dx.doi.org/10.1186/1756-0500-5-596


Progress 01/01/11 to 12/31/11

Outputs
OUTPUTS: Objective 1. Two versions (2.11 and 2.12) of the DANDA.SAS software have been developed. The 2.12 version is being optimized for SAS 9.3, which has made major changes in default output formats. A SourceForge website (open-source project management site) has been created as a second outlet for the DANDA.SAS software. A book proposal has been sent to SAS Press which will provide examples and usage details for DANDA.SAS. As part of the book development, version 2.12 has been extensively changed to make usage more streamlined. Objective 2. No progress. Objective 3. No progress. Objective 4. The only publication requiring substantial computing is listed, a microarray paper. However, RNAseq experimental data has been generated by one research group within the Institute of Agriculture, producing about 100,000,000 sequence reads. Preliminary analyses were impossible on standard desktop computers. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
Objective 1. Interest in DANDA.SAS is growing, as indicated by email questions. It is anticipated that when (if) the book project is published, this Hatch project will be successful at putting useful statistical tools in the hands of the scientific community. Objective 4. Application of statistics and computing to research data is essential in the genomics area, and genomics methods are becoming more widely used in all agricultural disciplines. Research support or collaboration is critical.

Publications

  • Hill RD, Gouffon JS, Saxton AM and Su C. 2011. Differential gene expression in the mice infected with distinct Toxoplasma strains. Infection and Immunity 2012 80(3):968-74. Epub 2011 Dec 5.


Progress 01/01/10 to 12/31/10

Outputs
OUTPUTS: Objective 1. A new version (1.30) of the DANDA.SAS software has been developed. After further testing this version will be posted on the DAWG website. This version has many new capabilities and bug fixes. Objective 2. No progress. Objective 3. A key program in DANDA.SAS has been translated for use in R. Objective 4. As indicated in the publications, several high demand computing problems were addressed. The Wadl paper is a good example, in which about 3 weeks of computing time was needed to compute permutation tests for genetic markers. The Lutz paper required adaptation of a previously written SAS macro for generating contrast statements needed to conduct a diallel analysis. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
Objective 1. DANDA.SAS version 1.30 does not require the IML package in SAS, which will allow greater access. Each package in SAS is an additional yearly fee, so some scientists did not have access to IML. Objective 4. Application of current statistical software on computing hardware has enhanced the extraction of scientific meaning from data in several research projects.

Publications

  • Kim HY, Stewart TP, Wyatt BN, Siriwardhana N, Saxton AM, Kim, JH. (2010) Gene expression profiles of a mouse congenic strain carrying an obesity susceptibility QTL under obesigenic diets. Genes & Nutrition 5(3), 237-250. 10.1007/s12263-009-0163-0.
  • Lutz CG, Armas-Rosales AM, and Saxton AM. Genetic effects influencing salinity tolerance in six varieties of tilapia (Oreochromis) and their reciprocal crosses. Aquaculture Research 2010:1-11.
  • Stewart TP, Kim HY, Saxton AM and Kim JH. 2010. Genetic and genomic analysis of hyperlipidemia, obesity and diabetes using (TALLYHO/JngJ x C57BL/6J) F2 mice. BMC Genomics 11:713.
  • Wadl PA, Saxton AM, Wang X, Pantalone VR, Rinehart TA and Trigiano RN. 2011. Simple Sequence Repeat (SSR) Markers Associated with Red Foliage in Cornus florida L. Molecular Breeding 27:409-416 10.1007/s11032-011-9551-4.