Source: IOWA STATE UNIVERSITY submitted to NRP
IMPROVING, COMMUNICATING, AND APPLYING STATISTICAL METHODS FOR DESIGNING AND ANALYZING AGRICULTURAL, ECOLOGICAL, GENETIC, NUTRITIONAL, AND ENVIRONMENTAL STUDIES
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
COMPLETE
Funding Source
Reporting Frequency
Annual
Accession No.
1010162
Grant No.
(N/A)
Cumulative Award Amt.
(N/A)
Proposal No.
(N/A)
Multistate No.
(N/A)
Project Start Date
Jul 18, 2016
Project End Date
Jun 30, 2021
Grant Year
(N/A)
Program Code
[(N/A)]- (N/A)
Recipient Organization
IOWA STATE UNIVERSITY
2229 Lincoln Way
AMES,IA 50011
Performing Department
Statistics
Non Technical Summary
Essentially all experiment station projects collect and analyze data. The questions being asked in these projects and the nature of data being collected are increasingly complex. Appropriate analysis and interpretation of these data requires developing new statistical methods and better understanding of existing methods. Typical activities include using mathematics to derive a method or prove theoretical properties of a statistical method, developing computer code to fit a model, devising more efficient algorithms for large or difficult problems, and evaluating methods by numeric simulation. The specific mix of activities depends on the problem, the data and its characteristics, and the nature of the question to be answered with those data.There are significant economic, societal, and environmental costs when statistical methods are used incorrectly or inefficiently. Research quality is enhanced and research costs are reduced when researchers are able to collaborate with well-trained statisticians.
Animal Health Component
40%
Research Effort Categories
Basic
20%
Applied
40%
Developmental
40%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
90173102090100%
Goals / Objectives
Develop and implement more appropriate tools for the analysis of agricultural and biological data.Develop new statistical methods to provide better research tools for researchers in agriculture and biology.Evaluate the properties of new and current statistical methods when applied to complex problems and complex data.Collaborate with and advise researchers on the appropriate use of new statistical methods.
Project Methods
In general, statistical research does a combination of figuring out how to do something, showing that it has desirable statistical properties, and providing a reasoned choice among alternative statistical methods. Typical activities include deriving an estimator, proving theoretical properties of a statistical method, developing computer code to fit a model, devising more efficient algorithms for large or difficult problems, and evaluating methods by numeric simulation. The specific mix of activities depends on the problem, the data and its characteristics, and the nature of the question to be answered with those data.Many of the statistical research needs arise from collaborations between statistics faculty and other researchers. Statistical research is strengthened by strong connections to practical issues, such as the need to answer increasingly complex subject-matter questions and the difficulties created by large or complex data sets. Because of the widespread availability of sophisticated statistical methods and often limited understanding of how to choose the most appropriate of those methods for a particular problem, effective collaboration often requires statistical research into the properties and robustness of the possible choices.Project evaluation metrics will include the number of papers on statistical methods published in the applied and statistical literatures and the number of active collaborations each year.

Progress 07/18/16 to 06/30/21

Outputs
Target Audience:Researchers in agriculture and biology and statisticians advising those researchers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Five statistics graduate students in the statistical consulting group have been trained in the science and practice of statistical consulting. How have the results been disseminated to communities of interest?Individual discussions with researchers. Presentations at conferences. Publication of journal articles in statistical and non-statistical journals. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Impact statement: Essentially all experiment station projects collect and analyze data. The questions being asked in these projects and the nature of data being collected are increasingly complex. Appropriate analysis and interpretation of these data require developing new statistical methods and better understanding of existing methods. Typical activities include using mathematics to derive a method or prove theoretical properties of a statistical method, developing computer code to fit a model, devising more efficient algorithms for large or difficult problems, and evaluating methods by numeric simulation. The specific mix of activities depends on the problem, the data and its characteristics, and the nature of the question to be answered with those data. There are significant economic, societal, and environmental costs when statistical methods are used incorrectly or inefficiently. Research costs are reduced when researchers are able to collaborate with well-trained statisticians. Appropriate design of studies and well-informed interpretation of statistical results produces higher quality research that is more likely to be published in top scientific journals. Objective 1: Develop new statistical methods to provide better research tools for researchers in agriculture and biology. Modern research generates data that often requires new or refined statistical methods. Project researchers are frequently called upon to develop these methods. The previous annual reports have reported on over 10 examples of new statistical methods that solve problems faced by researchers. Here we give one additional example. Cluster analysis is widely used to identify groups of similar samples (plants, field plots, metabolites) based on multivariate features. For -omic data such as microbiome data and single-cell RNA-sequencing data, clustering is challenging because the data are high dimensional (e.g., 1000's of features ) and sparse (many features are 0 or very close to 0). Dr. Peng Liu and her group have been working on developing new model-based clustering algorithms for sparse -omic data. They have developed methods to cluster features and bi-clustering methods that simultaneouslycluster both samples and features. Using bi-clustering, researchers can understand both the relationships among the features and the relationships among the samples. Liu et al. have devised computationally efficient algorithms that can handle large data sets and implemented them in a freely available R package. These new methods provide tools that extract patterns from large data sets, which can suggest hypotheses about the functions of -omic features. Objective 2: Evaluate the properties of new and current statistical methods when applied to complex problems and complex data. A frequent question posed to project researchers is "which statistical method is appropriate" for a specific question, study design, and dataset. Answering this requires evaluating multiple possible choices of method. Previous annual reports have provided multiple examples of these evaluations. Here, we describe one additional example. Many statistical analyses require making data-based choices. Two examples are choosing a correlation model for repeated observations on the same subjects and choosing a detection probability model when estimating the number and survival of animals. Until recently, a single model was chosen and assumed to be the correct model. In practice, that single model is chosen using the data, which does not always lead to the correct model. The single model approach ignores the uncertainty arising from not knowing the correct model. Model averaging combines information from multiple models, which accounts for the uncertainty in the choice of model. We compared different ways to implement model averaging and found they could give quite different results. The most appropriate model averaging procedure is to explicitly choose the relative likelihood of different models. All the alternative methods correspond to implicit choices of that likelihood, and that implicit choice may not be appropriate. We have already begun communicating this work to applied scientists in agronomy and natural resources. It will enable more appropriate analyses of data arising from complicated studies. Objective 3: Collaborate with and advise researchers on the appropriate use of new statistical methods. The project director and co-directors provided regular and frequent advice on the appropriate use of statistical methods. In many cases, these require advanced statistical methods that are not familiar to subject-matter researchers. We provide one recent example. Sonication is an experimental tool for improving protein yield from vegetarian sources such as wheat and soy. But, misuse of sonication technology can degrade the quality of the protein. Researchers in the department of Food Science and Human Nutrition wanted to identify the optimal sonication level at multiple combinations of incubation time, dilution, slurry ratio, and material type (flour versus flakes). Dr. Somak Dutta designed and analyzed an elaborate factorial design to find conditions that maximize the amount of protein yield and its quality. During the first stage of experiments, sonication improved the protein yield by between 10 to 15%. Process time could be reduced by around 15 minutes without significantly reducing protein yield. This reduces the energy required for processing. Soy flour required less dilution than did soy flakes. Further experiments are in progress to study the effect of these parameters on protein quality. These studies identify the most cost-effective combination of dilution and material type. Sonification is a method that reduces production costs and increases protein yield, which benefits producers and consumers.

Publications

  • Type: Journal Articles Status: Published Year Published: 2020 Citation: Guo, X., Qiu, Y. Nettleton, D. Yeh, C-T Zheng, Z. Hey, S. Schnable, P.S. 2021. KAT4IA: K-Means Assisted Training for Image Analysis of Field-Grown Plant Phenotypes. Plant Phenomics
  • Type: Journal Articles Status: Published Year Published: 2021 Citation: Wang, R Dutta, S Roy, V 2021. A note on marginal correlation based screening. Statistical Analysis and Data Mining 14(1): 88-92
  • Type: Journal Articles Status: Published Year Published: 2021 Citation: Leonard, S.M. Xin, H. Brown-Brandl, T Ramirez, B.C. Johnson, A.K. Dutta, S. Rohrer, G.A. 2021. Effects of farrowing stall layout and number of heat lamps on sow and piglet behavior. Applied Animal Behavior Science 239, #105334
  • Type: Journal Articles Status: Published Year Published: 2021 Citation: Leonard, S.M. Xin, H. Ramirez, B.C. Stinn J.P. Johnson, A.K. Dutta, S. Liu K. Brown-Brandl, T 2021. Static and dynamic space usage of late stage gestation sows. Transactions of the ASABE 64(1): 151-159
  • Type: Journal Articles Status: Published Year Published: 2021 Citation: Rahman, M.M. Dutta, S. Lamshal, B.P. 2021. High-power sonification assisted extraction of soy protein from defatted soy meals: Influence of important process parameters. Journal of Food Process Engineering 44(7): e13720
  • Type: Journal Articles Status: Published Year Published: 2021 Citation: Ward, J.L. Chou, Y-Y Yuan L. Dorman K.S. Mochel J.P. 2021 Retrospective evaluation of a dose-dependent effect of angiotensin-converting enzyme inhibitors on long-term outcome in dogs with cardiac disease. J. Vet. Internal Med. 35:2102-2111
  • Type: Journal Articles Status: Published Year Published: 2021 Citation: Velasquez-Zapata, V. Elmore J.M. Banerjee, S. Dorman, K.S. Wise, R.P. 2021. Next-generation yeast-two-hybrid analysis with Y2H-SCORES identifies novel interactors of the MLA immune receptor. PLoS Computation Biology 12, e1008890
  • Type: Journal Articles Status: Published Year Published: 2021 Citation: Martin-Schwarze, A. Niemi, J. Dixon, P. 2021. Joint modeling of distances and times in point-count surveys. J. Ag. Biol. Env. Statistics 26:289-305
  • Type: Journal Articles Status: Published Year Published: 2021 Citation: Stephenson, M. Schulte, L. Klaver, R. Niemi, J. iButton temperature dataloggers increase sample size and precision when estimating daily survival rate for bird nests. https://doi.org/10.1111/jofo.12389
  • Type: Journal Articles Status: Submitted Year Published: 2021 Citation: English, L. Niemi, J. Wilsey, B. Goode, K. Liebman, M. Understanding the variation in vegetation composition of prairie restorations within crop fields. submitted to Ecological Restoration
  • Type: Conference Papers and Presentations Status: Published Year Published: 2021 Citation: Dixon, P.M. Where do Bayesian methods fit in an applied statisticians toolbox. Invited presentation, Conference on Applied Statistics in Agriculture and Natural Resources.


Progress 10/01/19 to 09/30/20

Outputs
Target Audience:Researchers in agriculture and biology and statisticians advising researchers. Changes/Problems:Dr. Sarah Nusser has joined the project upon her return to the Statistics Dept. after retiring from a full-time administrative position. What opportunities for training and professional development has the project provided?Five statistics graduate students working in the statistical consulting group have been trained in the science and practice of statistical consulting. How have the results been disseminated to communities of interest?Individual discussions with researchers. Publishing manuscripts in the applied and statistical literatures. What do you plan to do during the next reporting period to accomplish the goals?Continue to respond to needs for new statistical methods. Publish current work in leading scientific and statistical journals.

Impacts
What was accomplished under these goals? Overall impact statement: Collecting and analyzing data are important components of most experiment station projects. Often these require new statistical methods, evaluation of methods in novel settings, and sharing the use and interpretation of statistical methods. Some examples from this project period are described below.In all cases, our goal is to extract more useful information from data. Appropriate design of studies and well-informed interpretation of statistical results producehigher quality research that is more likely to be published in top scientific journals Objective 1: Develop new statistical methods to provide better research tools for researchers in agriculture and biology. Modern research generates data that often requires new or refined statistical methods. Project researchers frequently develop these methods. Here we give four examples. Accurate genotyping of crop species is critical for breeding efforts, but next generation sequencing data often contains errors. Karin Dorman and her group developed a novel statistical method and accompanying software to identify true genetic variants in error-prone sequence data sampled from complex biological populations. The most common application is to microbiome studies that evaluate how a microbial population responds to perturbation. They show that their method is better able to distinguish true variants from error variants. Methods like theirs underlie studies on the role of the microbiome in human health, the effect of antibiotics on animal agriculture, and the effect of animal agriculture on the environment. Breeding for desirable plant phenotypes is faster when the genetic basis of those phenotypes is known. One important but poorly understood phenotype is development rate. Dan Nettleton's group is analyzing leaf appearance rates to identify the timing of distinct developmental phases for hundreds of maize genotypes. The statistical method employed is piecewise linear modeling of the number of leaves over time. Breakpoints where the slope changes indicate potential changes in developmental phase. This work helps researchers understand the effects of environment and genotype on plant development. The ultimate goal is to identify specific genes that control development to construct genotypes with superior developmental profiles. One serious issue with large observational data sets is that the relationship to the population of interest is unknown. This contrasts with well-designed surveys where that relationship is known. Data integration combines multiple data sets to make the best use of all available information. Jae-Kwang Kim has developed several data integration methods to combine survey sample data with large non-probability observational datasets. The survey sample is used to calibrate the observational data. Using data integration with data from the Australian Agricultural Census provides better estimates of regional crop production and water use. Data are frequently required to be publicly available so others can use them for new purposes. For example, USDA has a public access policy that requires principal investigators to share data. However, successfully using previously collected data requires more than the numbers. It requires understanding the purpose of the original study, its methods, and any quirks of the data. Sarah Nusser is researching what makes data reusable by another researcher. She is asking what kind of documentation or infrastructure is needed to understand the purpose of a study, its methods, the data and how the data were processed and analyzed? How can the burden of creating quality data products be reduced? What are the current barriers faced by researchers in sharing research data? Identifying best practices will help researchers implement good research practices. In addition, better prepared data sources will increase the visibility and impact of publicly accessible research data funded by the USDA. Objective 2. Evaluate the properties of new and current statistical methods when applied to complex problems and complex data. A frequent question posed to project participants is "which statistical method is appropriate?" for a specific data set and question. This requires evaluating and comparing different statistical methods. In some cases, this requires evaluating different statistical models. In other cases, this requires evaluating alternative computational algorithms. We describe two examples. Many agricultural studies generate structured data such as multiple locations in each of 3 states that are combined to estimate an average response. One example is the estimation of below ground carbon sequestration by crops. When there are different numbers of locations in each state, the usual analyses require approximations. Different approximations produce quite different results. Philip Dixon is evaluating which is the most appropriate approximation and has found that this depends on the relative variability between studies. The work informs how to best analyze and report results from many different types of agronomic studies. Peng Liu is developing models to predict phenotypic traits in crops, especially sorghum. One difficulty is the large number of variables that potentially influence the phenotype. Models predict better when they use only the important variables. When the relationship between variables and the phenotype is complex, machine learning tools have been used to model that complexity. Peng Liu is evaluating how well different machine learning methods select the "important" features for predicting plant phenotypic traits. Objective 3: Collaborate with and advise researchers on the appropriate use of new statistical methods. The project director and co-directors provide regular and frequent advice on the appropriate use of statistical methods. We provide three examples. Jarad Niemi collaborated with Mark Tomer, USDA, to analyze data from the agriculture conservation planning framework (ACPF) to determine placement for conservation practices including controlled drainage, contour buffer strips, water and sediment control basins, and grassed waterways in HUC12 watersheds. The landscape classifications could be used together to develop effective regional conservation strategies using precision planning tools. Alicia Carriquiry continued working with nutritionists around the world. In particular, she collaborated with Uruguayan nutritional scientists in the design and analysis of the first nationwide food consumption survey of school-age children in Uruguay. In addition to informing food assistance policy in Uruguay, the data contribute to a continent-wide study that is underway and managed by PAHO, the Panamerican Health Organization. Another ongoing collaboration is with Prof. Zimmerman from ETH in Switzerland;the project collects information on iodine consumption in over 40 countries around the world to understand issues associated with iodine deficiencies and inform development of interventions. Somak Dutta collaborated with Hongwei Xin and colleagues to develop statistical models to analyze sow and piglet behavioral data in order to enhance sow and piglet production performance. These models provide equations that predict sow space usage as a function of current weight. The models suggest that the traditional crate size and one heat lamp per crate is optimal for sow and piglet production because larger crate sizes and an additional heat lamp cost more but do not improve production. An additional benefit to producers is a tool to efficiently design sow housing during gestation and farrowing. Dutta has also been working on statistical models to predict grain yields of novel crop varieties in future environments. The results improve the yield prediction accuracy by 10-40% and reduce the uncertainty in yield prediction by 10-15% depending on the environments.

Publications

  • Type: Journal Articles Status: Published Year Published: 2020 Citation: Almodo?var-Rivera, I. A. and Maitra, R. (2020). Kernel-estimated nonparametric overlap-based syncytial clustering. Journal of Machine Learning Research, 21:122:154.
  • Type: Book Chapters Status: Awaiting Publication Year Published: 2021 Citation: Dutta, S. and Mondal, D. (2021). On the usefulness of lattice approximations for fractional Gaussian fields. Handbook of Statistics. Volume 44
  • Type: Journal Articles Status: Awaiting Publication Year Published: 2021 Citation: Fisher, K.E., Dixon, P.M., Han, G., Adelman, J.S., Bradbury, S.P. in press. Locating small animals using automated VHF radio telemetry with a multi-antennae array. Methods in Ecology and Evolution
  • Type: Journal Articles Status: Published Year Published: 2020 Citation: Leonard, S., Xin, H., Brown-Brandl, T., Ramirez, B., Dutta, S., and Rohrer, G. (2020) Effects of farrowing stall layout and number of heat lamps on sow and piglet production performance. Animals, 10 (2), 348.
  • Type: Journal Articles Status: Accepted Year Published: 2021 Citation: Leonard, S. M., Xin, H., Ramirez, B. C., Stinn, J. P., Dutta, S., Liu, K., and Brown-Brandl, T. M. in press. Static and dynamic space usage of late stage gestation sows. To appear in Transactions of the ASABE
  • Type: Journal Articles Status: Published Year Published: 2020 Citation: Mao, X., Dutta, S., Wong, R. K. W., Nettleton, D. (2020). Adjusting for spatial effects in genomic prediction. Journal of Agricultural, Biological, and Environmental Statistics. 25, 699718
  • Type: Journal Articles Status: Awaiting Publication Year Published: 2021 Citation: Martin-Schwarze, A., Niemi, J., Dixon, P.M. in press, Joint modeling of distances and times in point-count surveys. Journal of Agricultural, Biological and Environmental Statistics
  • Type: Journal Articles Status: Accepted Year Published: 2021 Citation: Peng, X. and Dorman, K. S. in press AmpliCI: A High-resolution Model-Based Approach for Denoising Illumina Amplicon Data, Bioinformatics,
  • Type: Journal Articles Status: Accepted Year Published: 2020 Citation: Wang, R., Dutta, S., and Roy. V. (2020). A note on marginal correlation based screening. Statistical Analysis and Data Mining
  • Type: Journal Articles Status: Published Year Published: 2020 Citation: Yang, S. and Kim, J.K. (2020). Statistical Data Integration in Survey Sampling: A review, Japanese Journal of Statistics and Data Science 3, 625650.
  • Type: Journal Articles Status: Published Year Published: 2020 Citation: Dean, A.N., Niemi, J.B., Tyndall, J.C., Hodgson, E.W., and ONeal, M.E. (2020) Developing a decision-making framework for insect pest management: a case study using soybean aphid (Hemiptera: Aphididae). Pest Management Science (online)
  • Type: Journal Articles Status: Published Year Published: 2020 Citation: Lewis-Beck, C., Walker, V.A., Niemi, J., Caragea, P., and Hornbuckle, B.K. (2020) Extracting Agronomic Information from SMOS Vegetation Optical Depth in the US Corn Belt Using a Nonlinear Hierarchical Model. Remote Sensing. 12(5): 827.
  • Type: Journal Articles Status: Published Year Published: 2020 Citation: Tomer, M.D., Van Horn, J.D., Porter, S.A., James, D.E., and Niemi, J. (2020) Comparing agricultural conservation planning framework (ACPF) practice placements for runoff mitigation and controlled drainage among 32 watersheds representing Iowa landscapes. Journal of Soil and Water Conservation. 75(4): 460-271.


Progress 10/01/18 to 09/30/19

Outputs
Target Audience:Researchers in agriculture and biology and statisticians advising researchers. Changes/Problems:Professor Ken Koehler retired in 2018. His contributions have been taken over by the other investigators. What opportunities for training and professional development has the project provided?Six statistics graduate students working in the statistical consulting group have been trained in the science and practice of statistical consulting. How have the results been disseminated to communities of interest?Individual discussions with researchers. Publication of journal articles. What do you plan to do during the next reporting period to accomplish the goals?Continue to respond to needs for new statistical methods. Publish current work in leading scientific and statistical journals.

Impacts
What was accomplished under these goals? Overall impact statement: Collecting and analyzing data are important components of most experiment station projects. Effective analysis of new forms of data requires new statistical methods, evaluation of current methods in novel settings, and communication about the use and interpretation of statistical methods. In the last year, we have developed new statistical methods for various applications and describe three of these below. In all cases, our goal is to extract more useful information from data. We have also evaluated statistical methods for novel and complex data.Finally, the project director and co-directors provided regular and frequent advice on the appropriate use of statistical methods. Appropriate design of studies and well-informed interpretation of statistical results produces higher quality research that is more likely to be published in top scientific journals. Objective 1: Develop new statistical methods to provide better research tools for researchers in agriculture and biology. Modern research generates data that often requires new or refined statistical methods. Project researchers are frequently called upon to develop these methods. Here we give three examples. Peng Liu is using her expertise in statistical genomics to develop better methods to analyze microbiome data, one of the most active research areas in agricultural and biological sciences. Detecting relevant microbiome features is challenging because microbiome data are high dimensional with many interactions.Liu and her students developed a method to estimate population effects of individual microbiome features for each combination of host, environment and microbe. When their approach was applied to an agricultural study of the rhizosphere microbiome of sorghum in response to nitrogen fertilizer application, the identified microbial taxa were consistent with biological understanding of potential plant-microbe interactions. These statistical analyses have helped plan future experimental studies. Ranjan Maitra specializes in statistical methods for clustering and classification.He has developed statistical methods to simultaneously categorize observations into groups and characterize each group. This has many applications in agronomy and animal science, including classifying wheat kernels, olive oils, and wines. He and his students have developed a fast machine learning method that categorizes observations while more accurately characterizing the individual groups.A second problem is to classify observations using matrix-valued data.Typical classification and clustering problems use vector-valued data, i.e. one value for each attribute measured on a sample. Matrix-valued data have multiple measurements for each sample.One example is classifying land cover types from multi-spectral satellite imagery; these are matrix-valued because there are many measurements for each location. A new statistical method reduced the error rate and provided more accurate land type classifications. A major challenge in plant breeding is understanding how new hybrids will perform in new environments. Not all new varietiescan be evaluated by field experiments in all environments. Somak Dutta and others developed a novel hierarchical spatial model that accommodates missing combinations of variety and environment and can predict the yield of novel varieties in new environments. The model includes spatial adjustments to better estimate genotypic effects, genomic kinship information to predict yield for new varieties,and environmental covariate information to make predictions for new environments. The method provides better estimates of genotype-by-environment interactions and predictions of yield in new environments. The method is supplemented with a publicly available open source software. Objective 2: Evaluate the properties of new and current statistical methods when applied to complex problems and complex data. A frequent question posed to project researchers is "which statistical method is appropriate" for a specific data set and question. This requires evaluation of the various statistical methods that might be used. We describe one example. Philip Dixon was asked to help analyze data on invasiveness of horticultural species in Iowa and Minnesota. The overall concern is whether a plant grown in a garden "escapes" and becomes established in surrounding natural areas.This is a binary (yes/no) response. Current statistical methods apply when completely different species are in both data sets (unpaired data) or when all species are studied in both states (completely paired data). These data are partially paired; some species were studied in only one state, but some are in both states. When the data have only one yes/no value for each species and state, none of the traditional statistical methods provided good comparisons of invasiveness.Only a Bayesian method provided reliable inferences. Objective 3: Collaborate with and advise researchers on the appropriate use of new statistical methods. The project director and co-directors provided regular and frequent advice on the appropriate use of statistical methods. In many cases, these require advanced statistical methods that are not familiar to subject-matter researchers. We provide four examples. Dan Nettleton is analyzing field trials of maize genotypes treated with different nitrogen levels. Stalk nitrate levels are measured to estimate nitrogen use efficiency in the various genotypes.Advanced statistical methods that adjust for differences in soil type and spatial trends across the experimental plots produce more precise estimates of nitrogen uptake. The goal of this work is to more quickly identify maize genotypes that use nitrogen efficiently and effectively. This will have economic benefits to farmers and improve water quality. Jarad Niemi is helping investigate the effect of dietary arginine supplementation in a commercial swine herd.Previous laboratory experiments showed that dietary arginine supplementation improved reproductive and offspring outcomes. However, when applied in a production setting, there was minor impact on reproductive performance and some moderate production benefits. Overall, arginine supplementation needs to be better understood before including it as a standard practice. Mark Kaiser used advanced statistical methods to relate the time of onset of Sudden Death Syndrome in soybeans to severity and yield loss at the end of the growing season. Both final severity and yield loss were recorded as proportions, which are more appropriately analyzed using a beta distribution than the usual normal distribution.Fitting these models to data from multiple fields showed that the slope coefficients in the beta regression models were not significantly different, even the overall data patterns were quite dissimilar. Using the more appropriate beta response distribution fit the data from all fields and identified an important feature shared across all fields. Peng Liu provides expert advice on the design and analysis of experiments collecting "omic" data. Two groups at ISU are using a new technology, paired Ribo-seq and RNA-seq experiments, to study translational efficiency. A review of available methods found that they ignore a key feature of the experimental design. To address this, she and her student proposed a novel statistical method, developed algorithms to implement the method, and used it to estimate translational efficiency.

Publications

  • Type: Journal Articles Status: Accepted Year Published: 2019 Citation: Dai, F., Dutta, S., and Maitra, R. (2019). A Matrix--free Likelihood Method for Exploratory Factor Analysis of High-dimensional Gaussian Data. Tentatively accepted at Journal of Computational and Graphical Statistics subject to minor revisions. arXiv preprint arXiv:1907.11970.
  • Type: Journal Articles Status: Published Year Published: 2019 Citation: Berry, N. & Maitra, R. (2019). TIK-means: transformation-infused k-means clustering for skewed groups. Statistical Analysis and Data Mining  The ASA Data Science Journal, 12:3:223233. doi: 10.1002/sam.11416
  • Type: Journal Articles Status: Accepted Year Published: 2019 Citation: Thompson, G. Z., Maitra, R., Meeker, W. Q. & Bastawros, A. (2019). Classification with the matrix-variate-t distribution. Journal of Computational and Graphical Statistics, Accepted. doi: 10.1080/10618600.2019.1696208.
  • Type: Journal Articles Status: Published Year Published: 2019 Citation: Hines, E.A., et al. 2019. The impact of dietary supplementation of arginine during gestation in a commercial swine herd: I. Gilt reproductive performance. Journal of Animal Science 97:3617-3625
  • Type: Journal Articles Status: Published Year Published: 2019 Citation: Hines, E.A. et al., 2019. The impact of dietary supplementation of arginine during gestation in a commercial swine herd: II. Offspring performance. Journal of Animal Science 97:3626-3635


Progress 10/01/17 to 09/30/18

Outputs
Target Audience:Researchers in agriculture and biology and statisticians advising researchers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Five statistics graduate students working in the statistical consulting group have been trained in the science and practice of statistical consulting. How have the results been disseminated to communities of interest?Individual discussions with researchers. What do you plan to do during the next reporting period to accomplish the goals?Continue to respond to needs for new statistical methods. Publish current work in leading scientific and statistical journals.

Impacts
What was accomplished under these goals? Overall impact statement: Collecting and analyzing data are important components of most experiment station research. Effective analysis of new forms of data requires new statistical methods, evaluation of current methods in novel settings, and communication about the use and interpretation of statistical methods. In the last year, we have developed new statistical methods for various applications and describe five of these below. In all cases, our goal is to extract more useful information from data. We have also evaluated statistical methods for novel and complex data. Finally, the project director and co-directors provided regular and frequent advice to researchers on the appropriate use of statistical methods. Output examples: a tool for Monarch researchers to use to follow butterflies moving among different habitats in a field, methodologies to assist crop breeders to breed new and improved crops and also provide scientists with new tools to study plant genetics, and a drought susceptibility index which can help researchers better identify maize genotypes less sensitive to drought conditions. Appropriate design of studies and interpretation of statistical results produces higher quality research that is more likely to be published in top scientific journals. Objective 1. Develop new statistical methods to provide better research tools for researchers in agriculture and biology. Single Nucleotide Markers are critical for breeding projects. However, it is difficult to develop markers in allotetraploid species, which include many agriculturally important crops, because it is difficult to distinguish true SNPs from differences between the chromosomal copies. Karin Dorman and colleagues are developing a statistical method to genotype allotetraploid species such as peanut and cotton. This will provide an important tool to enhance breeding programs for these crops. Karin Dorman's group is also developing a statistical likelihood-based method to extract more and cleaner information from Next Generation Sequencing (NGS) datasets. Some of their datasets are so noisy that less than 5% of the reads are making it through standard bioinformatics pipelines. Their method processes Illumina NGS reads to parse unique molecular identifies, remove adapters, and merge paired end reads. While there are many, many methods for read preprocessing to accomplish several of these individual tasks, few of them take advantage of the known structure of custom-designed read targets. Dan Nettleton has been working with plant scientists at Iowa State to develop a drought sensitivity index based on maize growth curves under irrigated and non-irrigated growing conditions. Two irrigated and two non-irrigated fields were each planted with 100 genotypes in Nebraska during the summer of 2017. Each plot in each field was photographed daily using networked digital cameras installed in the fields. Images were processed with the help of online workers to obtain an estimate of plant height for each genotype in each field. A nonparametric growth curve estimate was generated for each combination of genotype and environment (irrigated vs. non-irrigated) using functional principal component methods. The area between the irrigated and non-irrigated estimated growth curves was calculated for each genotype. The larger the area, the greater the discrepancy between the growth curves and the greater the evidence that withholding irrigation has a negative impact on plant growth. Thus, these areas define our proposed drought susceptibility index. By accounting for differences in growth over a large portion of the growing season, we are able to detect sensitivities that may be invisible to researchers collecting data only at the end of a season. The ultimate goal of the work is to better identify maize genotypes less sensitive to drought conditions and to facilitate effective maize breeding programs that will lead to the creation of maize genotypes with stable performance in our changing climate. Genome-wide association studies have been popular in identifying genes involved with a variety of important phenotypic traits in plants. Genetic information on millions of markers is collected by modern sequencing technologies is related to phenotypic information collected via large scale agricultural field trials. The genetic effects of interest are frequently blurred by spatial heterogeneity in a field. Unless the spatial effects are adequately accounted for, estimates of genetic effects can be severely biased, which could lead to incorrect genomic selection. Somak Dutta has been developing statistical methodologies for adjusting for spatial effects and improving the accuracy of marker selection. A new penalized spatial mixed linear model combines adjustment for spatial heterogeneity and selection of genotypes in a single model. This synthesis enhances the accuracy of marker association by guarding against over-fitting of the spatial effects but adequately adjusting for them. Keeping in mind the inherent large data size arising from multiple large scale agricultural trials, Dutta has been developing sophisticated computational framework that will allow fast statistical computations. These methodologies will assist crop breeders to breed new and better crops and also provide scientists with new tools to study plant genetics. Philip Dixon, in collaboration with 3 Monarch butterfly researchers, is developing statistical methods to estimate butterfly locations from radio telemetry data. The Monarch researchers attach a small radio transmitter to a butterfly, then measure the signal strength at 16 directional antennae placed at the corners of a field. The goal is to estimate the location of the butterfly. For practical reasons, these data are collected in a very different way from traditional radio telemetry data, which requires a new statistical analysis. Our method uses information about how signal strength depends on the angle and distance between butterfly and antenna. Preliminary analyses suggest we can measure a butterfly location within 30m in a 300m x 300m field. This work provides Monarch researchers with a tool to follow butterflies moving among different habitats in a field. Objective 2. Evaluate the properties of new and current statistical methods when applied to complex problems and complex data. A frequent question posed to project researchers is "which statistical method is appropriate" for a specific data set and question. This requires evaluation of the various statistical methods that might be used. One example... A genome-wide association study with data from an agricultural field trial commonly uses a two-stage method. In the first stage, a mixed linear model is employed to adjust for intra-block correlation, spatial correlation and other nuisance factors and best linear predictors (BLUPs) of genotypic values are obtained. In the second stage, BLUPs are used for detecting genome-wide association. Somak Dutta and his students are currently evaluating the performance of the two stage model and Dutta's synthetic method using extensive computer experiments. The results of these experiments will provide a comprehensive comparison between the two approaches in terms of computational complexity and selection accuracy. Objective 3. Collaborate with and advise researchers on the appropriate use of new statistical methods. Two examples... Karin Dorman frequently is asked for advice about genomic data analysis. One example is advising on building a pipeline for processing reads of CRISPR/Cas9 barcodes, which are currently being developed to facilitate cell lineage tracing in developmental biology of plants and animals. Jarad Niemi is the data lead for the STRIPS project. This project evaluates the ecosystem value of incorporating prairie strip treatments into corn and soybean crops. He advises on all aspects of data management and statistical analysis for this very large project.

Publications


    Progress 10/01/16 to 09/30/17

    Outputs
    Target Audience:Researchers in agriculture and biology and statisticians advising researchers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Statistics graduate students have been trainedin the science and practice of statistical consulting. How have the results been disseminated to communities of interest?Individual discussions with researchers What do you plan to do during the next reporting period to accomplish the goals?Continue to respond to needs for new statistical methods. Publish current work in leading scientific and statistical journals.

    Impacts
    What was accomplished under these goals? OVERALL IMPACT: Collecting and analyzing data are important components of most experiment station projects. Effective analysis of new forms of data requires new statistical methods, evaluation of current methods in novel settings, and communication about the use and interpretation of statistical methods. We developed new statistical methods for three applications, two in genomics and one in animal behavior. And weare evaluating statistical methods to be used for novel and complex data. In addition, the project director and co-directors provided regular and frequent advice on the appropriate use of statistical methods.Appropriate design of studies and interpretation of statistical results produces higher quality research that is more likely to be published in top scientific journals. Objective 1: Develop new statistical methods to provide better research tools for researchers in agriculture and biology. RNA-Sequencing (RNA-seq) technologies have been popularly applied to study gene expression in all fields of biology. Identifying differentially expressed (DE) genes across treatments is one of the major steps in RNA-seq data analysis. Peng Liu and her student, Ran Bi, developed a flexible Bayesian mixture modeling procedure with a Dirichlet process for differential expression analysis between two conditions.This procedure can be widely applied to gene expression studies in plants, animals and other organisms. They also extend the method to a multiple-condition analysis. Specifically, they developed a semi-parametric Bayesian method for the detection of gene expression heterosis, i.e., genes whose expression levels in a hybrid offspring are higher (or lower) compared with both of its two inbred parents. Although the phenomenon of heterosis is widely applied in agriculture, the mechanism of heterosis is still unknown. Standard existing statistical approaches for RNA-seq analysis are not directly applicable for testing heterosis. Liu's effort helps to understand phenotypic heterosis at the molecular level. Dan Nettleton has been developing methods for the analysis of RNA-seq data in repeated-measures experiments. These experiments assess changes in RNA transcript levels within individuals over time. One example is the study of the molecular response to infection in plants or animals. A repeated measures experiment reduces unwanted variability by repeatedly collecting blood from the same animal or leaf tissue from the same plant.By monitoring RNA transcript levels in these repeated samples, researchers can find genes that play key roles in successful or unsuccessful responses to infection. The statistical methods Nettleton is developing are used to determine which changes in transcript levels are simply the result of random variation and which changes suggest an important functional role for a gene or genes. Identifying such genes is a first step to understanding the molecular genetic response to infection and to developing animals and plants that can mount a successful defense against pathogens. Objective 2: Evaluate the properties of new and current statistical methods when applied to complex problems and complex data. More and more systems-level studies are undertaken to get an interactive picture of how genes work in cells. These studies integrate multiple types of data.For example, translation efficiency can be assessed by performing ribosome profiling experiments and RNA-seq experiments on the same set of samples to measure. Inferring changes in translation efficiency across treatments or conditions requires a joint analysis of the two types of data. Peng Liu and her student are currently evaluating the existing methods for analysis of translation efficiency change, and trying to improve the currently available methods by developing new approaches using models more appropriate for such data sets. On-farm trials compare treatments, usually management practices, across many farmers's fields. On-farm trials differ from typical single-location experiments in a statistically important way.The variability between plots in an on-farm trial is likely to differ between farms, while it is usually similar for all plots in a typical single-location study. Fitting statistical models with farm-specific error variances is challenging and often unsuccessful. Philip Dixon is evaluating when it is appropriate to simplify the statistical model and assume a single error variance for all farms in an on-farm trial. The answer depends on the details of the design of the on-farm trial. When the same experimental design is used at all farms, the simplified statistical model provides correct inferences about treatment effects. It is much more important to worry about treatment-specific error variances, i.e. low variability between plots for some treatments and high variability for others. In this case, simplifying the statistical model overstates the precision of some treatment comparisons and understates the precision of others. Objective 3: Collaborate with and advise researchers on the appropriate use of new statistical methods. The project director and co-directors provided regular and frequent advice on the appropriate use of statistical methods. We provide four examples... Peng Liu has assisted Professor Stephen Howell of the Plant Sciences Institute to design a large RNA-seq experiment. She and her student, Ran Bi, also helped with the analysis of the dataset that was generated from this experiment. This effort led to the understanding of the transition from cell survival to cell death in response to persistent ER stress in maize seedlings. Peng Liu and her student, Emily Goren, are currently assisting Professor Julie Kuhlman and her student for the analysis of a single-cell RNA-seq (scRNA-seq) experiment. scRNA-seq technology has just been developed and the data are challenging to analyze. Many new statistical methods are quickly being developed, but it is unclear what methods should be recommended. Emily and Peng are evaluating new methods and advising collaborators on the appropriate tools for the data analysis. Jarad Niemi is working with the STRIPS team to analyze 10 years of data from an experiment to understanding the impacts of planting prairie within row-crop agriculture fields.He analyzed all 50 or so response variables collected by 10 different researchers. These included crop yield, nitrate runoff, pollinator richness, and bird abundance. They found that putting 10% of row-crops into prairie reduced runoff and soil loss, tripled pollinators, and doubled bird abundance.This work sets the stage for an ongoing confirmatory study on a wider range of Iowa row-crop farmland.Ultimately, this work suggest a way to retain soil and nutrients on Iowa farmland and reduce or eliminate the hypoxic zone in the Gulf of Mexico, and restore native prairie in Iowa. Management of odors from large hog production facilities is a prominent issue in Iowa and other state with substantial hog industries. Working with a team of agricultural engineers, Ken Koehler helped design experiments to assess the efficacy of new methods and materials for mitigating odors from hog wastes.

    Publications


      Progress 07/18/16 to 09/30/16

      Outputs
      Target Audience:Researchers and statisticians advising researchers. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest?Verbally and as written manuscripts. What do you plan to do during the next reporting period to accomplish the goals?Continue to respond to needs for new statistical methods. Publish current work in leading scientific and statistical journals.

      Impacts
      What was accomplished under these goals? Collecting and analyzing data are important components of most experiment station projects. Effective analysis of new forms of data requires new statistical methods, evaluation of current methods in novel settings, and communication about the use and interpretation of statistical methods. Appropriate design of studies and interpretation of statistical results produces higher quality research that is more likely to be published in top scientific journals. Objective 1.Develop new statistical methods to provide better research tools for researchers in agriculture and biology. We described new statistical methods for four applications, two in genetics and two in ecology. ChIP-seq experiments evaluate interactions between DNA and proteins. Well-justified conclusions from these studies require biological replication, but there is no consensus on how to analyze replicated data. All current methods have limitations, including poorer detection of response peaks and incorrect multiple testing error control. Peng Liu and colleagues have developed the BinQuasi method that jointly models biological replicates using a generalized linear model framework and detects peaks by a one-sided Quasi-likelihood ratio test. Her group has assembled an R package that implements these methods and is currently evaluating their performance in terms of peak detection and multiple testing error control. Next generation sequencing (NGS) data are now routinely collected in many biological laboratories. Standard methods of interpreting genetic sequences assumes the sequencer output is exactly correct, but errors can occur. In some cases, for example, when identifying rare genetic variants, sequencing errors seriously limit the utility of the data. Karin Dorman and student Xin Yin developed a new statistical model for correcting errors in NGS data and provided software to implement the method within a bioinformatics pipeline. They demonstrate that most existing ad hoc computational methods for error correction approximately estimate their NGS error model, which provides statistical support for these methods. However, the approximations are often extremely crude, which explains the substantial improvement seen with the new method. Experimental soil warming is commonly used to evaluate potential impacts of altered climate. Relating observed ecological responses to the hourly or daily measurements of soil moisture and temperature is complicated by the high correlations within each time series. Petrutza Caragea has developed a new statistical method that compares groups of time series, such as soil moisture or soil temperature, and identifies similarities and differences across groups, i.e. between time series, and across time, i.e. within time series. Estimating abundance of an elusive species often involves capture-recapture surveys where individuals are tagged and their subsequent recaptures are recorded. These recapture events allow estimation of the probability of detection and therefore species abundance. Unfortunately, it is difficult to capture and tag most song birds, so most studies record individuals when they sing. Jarad Niemi and colleagues developed statistical methods to estimate avian abundance based on counts of singing birds where the time of observation is recorded. They found that an assumption commonly made by bird ecologists is untenable and suggest more appropriate assumptions to estimate song bird abundance. Objective 2. Evaluate the properties of new and current statistical methods when applied to complex problems and complex data. Random forest methodology is a useful statistical learning methodology for predicting outcomes (e.g., corn yield) from predictor variables (e.g., monthly rainfall totals, monthly high and low temperatures, soil type, etc.). Random forests are known to provide excellent point predictions in a variety of contexts, but accurate assessments of uncertainty are lacking. Dan Nettleton and colleagues are developing and evaluating confidence and prediction intervals that describe the uncertainty in predicted means and individual observations. They have also developed a new methodology, known as regression-enhanced random forests, that can improve on random forest predictions by taking into account linear structures in the data and capitalizing on trends over time that can result in values for the outcome variable of interest that fall outside the range of the training data. Clustering individuals into groups with similar characteristics has many applications. The design of many studies results in some known groupings of individuals; semi-supervised clustering methods determine whether there are additional groups beyond those already-known ones. Ranjan Maitra and colleagues are evaluating how well these methods work in two applications to complex data. One is using pathogen behavior to identify poorly known pathogens that behave similarly to other well-studied and well-known pathogens. The other is identifying genes that are differentially expressed in a diurnal cycle. Objective 3. Collaborate with and advise researchers on the appropriate use of new statistical methods. The project director and co-directors provide regular and frequent advice on the appropriate use of statistical methods. We provide two examples. Statisticians at ISU, led by Alicia Carriquiry, developed statistical methods for analyzing dietary intake data. Carriquiry has collaborated extensively on the use of these methods. She participated in a panel established by the National Institutes of Health and Health Canada to evaluate the challenges using Dietary Reference Intakes when the health endpoint is a chronic disease. Her collaborations with nutrition professionals in Mexico, Philippines, Argentina, Colombia, Switzerland, China and Senegal have had international impact. Karin Dorman assisted a team of researchers at ISU and in Mexico establish reference intervals for various blood measurements in water buffalo, an increasingly important agricultural species in the western hemisphere. These reference intervals provide upper and lower bounds for diagnostic test measurements, such as blood panels, in healthy individuals. Current recommendations are to use nonparametric statistical methods to compute the reference intervals, but quantifying their uncertainty is particularly tricky. Dorman obtained nonparametric estimates of the reference intervals and implemented modern bootstrapping methods to quantify their uncertainty.

      Publications