Integrated Genomic System

التفاصيل البيبلوغرافية
العنوان: Integrated Genomic System
Document Number: 20110047189
تاريخ النشر: February 24, 2011
Appl. No: 12/678196
Application Filed: September 30, 2008
مستخلص: An integrated genomic system is a software system that facilitates data management and analysis connected with integrated genomic research, such as statistical genetics. Reference information, biological and experimental, describes context from which experiments are made. Reference information, including annotations for genes, markers, study, individuals, and so on, is input into the integrated genomic system for consolidation, accessibility, and linkage with other data to aid researchers to view influences and interactions in biological systems.
Inventors: Will, Hans-Martin (Redmond, WA, US); Anderson, Mark B. (Redmond, WA, US)
Assignees: Microsoft Corporation (Redmond, WA, US)
Claim: 1. A group of networked computers for viewing influences on interactions in biological systems selected from a group consisting of genetic background, infection stages, environmental states, life-style choices, and social structures, the group of networked computers comprising: a client application being executed on a client machine through which a user accesses a visual interface for viewing influences on interactions in biological systems; an application server being executed on a server machine for hosting applications and a job execution framework for off-loading jobs from the client application and automatically executing the jobs comprising the importation of biological data, statistical analyses, and the transformation of biological data; a compute cluster including job submission queues and cluster nodes being stored on a computer-executable medium, the cluster nodes including a head node, the head node being accessible by the server machine, the job submission queues being accessible by the job execution framework to place off-loaded jobs, input data of each job being transferred to the head node, the details of each job being transferred to a job submission queue of a cluster node of the compute cluster where the job is executed to produce biological analysis results; a relational database server storing reference information for genetic studies, participating study populations, and genetic markers that are under investigation, the relational database server being physically hosted on another server machine that is not the server machine hosting the application server; and a web-enabled collaborative document repository server, which is used to store and access two-dimensionally indexed data structures containing data matrices of genotype calls organized by study individual and genetic marker, the web-enabled collaborative document repository server being physically hosted on another server machine that is not the server machine hosting the application server.
Claim: 2. The group of networked computers of claim 1, wherein the applications include an application for providing a repository of biological reference information including genes and genetic markers.
Claim: 3. The group of networked computers of claim 1, wherein the applications include an application for providing a repository that captures genetic studies across research projects including the study design underlying a scientific experiment, groups of individuals participating in a study, assay technology used to determine genetic variation, technology-specific information pertaining to genetic markers being targeted in the scientific experiment.
Claim: 4. The group of networked computers of claim 1, wherein the applications include an application for providing a repository for assay results, each data point in an assay result being linked back to a piece of biological reference information and a piece of study design information.
Claim: 5. The group of networked computers of claim 1, wherein the applications include an application for implementing quality control procedures to exclude unreliable or questionable data points from analysis.
Claim: 6. The group of networked computers of claim 1, wherein the applications include an application that transforms a set of genetic variation measurements into exportable data to a set of analysis tools external to the group of networked computers.
Claim: 7. The group of networked computers of claim 1, wherein the applications include an application that captures parameters and input values for each analysis result or intermediate steps of data processing to create audit trails from each data point in each analysis result back to a boundary separating the group of networked computers from other computing machinery external to the group of networked computers.
Claim: 8. In execution on a group of networked computers, a computer-readable medium having computer-executable instructions stored thereon for implementing a method for analyzing interactions in biological systems, the method comprising: creating a study to capture a population of individuals being genotyped to calculate statistical results about a specific assay used to measure genetic variations for a set of markers; loading and copying of external genotype data files into data load data sets; creating a study data set to associate a genotype call to each individual and marker that are associated with the study by reconciling genotype calls for samples across one or more data load data sets; and creating an analysis data set to focus on a subset of the study data set by restricting the data shown to data points associated with a given individual list and marker list, the analysis data set being a two-dimensional organization of genotype information associated with a set of individuals and markers without using a copy of genotyping data.
Claim: 9. The computer-readable medium of claim 8, wherein creating a study includes specifying a unique study identifier, species information of organisms under investigation, and the specific genome assembly to be used for analysis, creating a study further including creating an individual panel representing individuals who are participating in the study, each individual being marked with a unique identifier and phenotypic information being extracted from the individual so as to classify the individual into sub-populations, creating a study yet further including selecting one or more marker panels for use in the study, each marker panel determining a kind of genotyping assay results that constitute valid data load data sets within the study.
Claim: 10. The computer-readable medium of claim 9, wherein loading genotype data includes loading a sample manifest into system memory for identifying samples present in a genotype data matrix, loading the genotype data determining marker panel associated with the genotype data matrix and determining dimensions of the genotype data matrix that needs to be created for loading the genotype data.
Claim: 11. The computer-readable medium of claim 10, wherein copying genotype data includes creating a first HDF5 file connected with the study data set so that dimensions of data matrices in the first HDF5 file have rows equal to the number of markers and columns equal to the number of individuals, copying genotype data further includes creating a second HDF5 file connected with the data load data set so that dimensions of data matrices in the second HDF5 file have rows equal to the number of markers and columns equal to the number of samples, each matrix being allocated using a block structure that partitions the matrix into blocks of data, each block being associated with a window defined by a range of marker identifications and a range of sample identifications, the window being associated with a queue, copying genotype data including copying data from an external genotype data file into blocks of data by comparing sample identifications and marker identifications of the external genotype data file with the identifier ranges for each window.
Claim: 12. The computer-readable medium of claim 11, wherein creating a study data set includes selecting one or more data load data sets to be combined into the study data set, creating a study data set further comprising creating a second HDF5 data file that contains a stack of two-dimensional matrices to organize genotyping information for a set of individuals and markers, the set of individuals being a union of all individuals represented in the data load data sets, the set of markers being a union of all markers in the data load data sets.
Claim: 13. The computer-readable medium of claim 12, wherein creating an analysis data set includes defining a two-dimensional organization of genotype information associated with a subset of individuals and markers extracted from the study data set without containing its own copy of genotyping data.
Claim: 14. A method for analyzing interactions in biological systems, the method comprising: creating a study to capture a population of individuals being genotyped to calculate statistical results about a specific assay used to measure genetic variations for a set of markers; loading and copying of external genotype data files into data load data sets; creating a study data set to associate a genotype call to each individual and marker that are associated with the study by reconciling genotype calls for samples across one or more data load data sets; and creating an analysis data set to focus on a subset of the study data set by restricting the data shown to data points associated with a given individual list and marker list, the analysis data set being a two-dimensional organization of genotype information associated with a set of individuals and markers without using a copy of genotyping data.
Claim: 15. The method of claim 14, wherein creating a study includes specifying a unique study identifier, species information of organisms under investigation, and the specific genome assembly to be used for analysis, creating a study further including creating an individual panel representing individuals who are participating in the study, each individual being marked with a unique identifier and phenotypic information being extracted from the individual so as to classify the individual into sub-populations, creating a study yet further including selecting one or more marker panels for use in the study, each marker panel determining a kind of genotyping assay results that constitute valid data load data sets within the study.
Claim: 16. The method of claim 15, wherein loading genotype data includes loading a sample manifest into system memory for identifying samples present in a genotype data matrix, loading the genotype data determining marker panel associated with the genotype data matrix and determining dimensions of the genotype data matrix that needs to be created for loading the genotype data.
Claim: 17. The method of claim 16, wherein copying genotype data includes creating a first HDF5 file connected with the study data set so that dimensions of data matrices in the first HDF5 file have rows equal to the number of markers and columns equal to the number of individuals, copying genotype data further includes creating a second HDF5 file connected with the data load data set so that dimensions of data matrices in the second HDF5 file have rows equal to the number of markers and columns equal to the number of samples, each matrix being allocated using a block structure that partitions the matrix into blocks of data, each block being associated with a window defined by a range of marker identifications and a range of sample identifications, the window being associated with a queue, copying genotype data including copying data from an external genotype data file into blocks of data by comparing sample identifications and marker identifications of the external genotype data file with the identifier ranges for each window.
Claim: 18. The method of claim 17, wherein creating a study data set includes selecting one or more data load data sets to be combined into the study data set, creating a study data set further comprising creating a second HDF5 data file that contains a stack of two-dimensional matrices to organize genotyping information for a set of individuals and markers, the set of individuals being a union of all individuals represented in the data load data sets, the set of markers being a union of all markers in the data load data sets.
Claim: 19. The method of claim 18, wherein creating an analysis data set includes defining a two-dimensional organization of genotype information associated with a subset of individuals and markers extracted from the study data set without containing its own copy of genotyping data.
Current U.S. Class: 707/803
Current International Class: 06; 06
رقم الانضمام: edspap.20110047189
قاعدة البيانات: USPTO Patent Applications