Bioinformatics Core Service

Data that comprises the bioinformatics core service

Data and Resources

Additional Info

Field Value
Access Contact bioinformatics@umich.edu
Access mechanism omitted
Access protocol All data sets into different project folders that are essentially shared across all personnel in the Core. That's a convention which has some flexibility for more secure needs. Occasionally (very) we need to do some analysis on a local workstation (e.g. a visualization). Project folders are internal to the core, if we need to exchange data with the PI we will set up a separate collaboration space and move data in and out of there in a dedicated fashion. The collaboration space is a shared directory between the core and research project personnel. The collaboration spaces are therefore a little bit more private.
Attribution Citation N/A
Business processes involved various research processes
Collection mechanism(s) Data comes in through the sequencing core or the PI, we analyze and freeze the output, and then keep it for some length of time. Annotations are downloaded on an ad-hoc basis right now; we're looking to get those onto a scheduled basis.
Content Exposure Up until now, we're not capturing data in a longitudinal database, it is left to the PI's to figure out what to share and what not to share; the PI's wholly determine what to make available as is appropriate; we would not do any of this for them. Down the road, with the possibility of secondary use, we want to figure out how to pose those questions to the PI; how to make the case for secondary use. Could make the case via offering to host the physical infrastructure of the data both for the PI as well as for their NIH (or whatever)-mandated publiation purposes, that would be a win-win.
Data Manager(s) none at this time
Data Provided By BRCF
Data Steward(s) none at this time
Data collection format(s) almost entirely flat files, 1-2 databases (annotation sources that we import)
Data element definitions Thinking annotations, axes of experiments, all detail. It would be good to have a Data Dictionary set up for this information. There is metadata around how the machine was set up to run, and all that. There's also whole genome, exome, and panel bases to run experiments.
Data lineage / mapping from other sources 3 feeds: sequencing core (typically DNA Core); whatever sample metadata that we get from the PI; reference data - genome sequence, annotation data - we could get more lineage on the reference data but we generally don't
Data profile
General guidelines for ability to access omitted
Grain of data collection some combination of Gene, Locus (smaller), Sample (larger)
Higher-level data models (logical and/or conceptual) Derived data sets - the DNA sequencing core deals with primary data, we deal with secondary (grooming and cleansing - consistent) and tertiary analyses (sense-making - widely variable)
How is this data used? We're combining DNA sequence with reference data to identify anomolies and annotate their impact. "DNA" is shorthand for 5 different types of experiments
Initial Creation Date on a per-service basis
Initial use cases/motivations for data collection We create derived artifacts that assist researchers in understanding the DNA sequencing information for their biological samples.
Last Modified Date on a per-service basis
Physical data infrastructure omitted
Physical data model, reverse-engineered almost entirely in the file system; data sets themselves are fairly silo'd files, so we can change structure easily but there is no longitudinal analysis
Primary Users / Customers Research PI's
Regulatory and other classifications at this point genomic information is not classified as PHI and we try not to capture any PHI
Retention schedule We have a tentative data lifecycle model that would give guidance to the stakeholders, but at this point it's just an idea. We have the infrastructure to use it; we have the infra to store data for about 1-2 years, but we don't really want to be in that business (of being the long-term archive for the investigators). One of our core ideas is to keep it for a year (?) after we finish the analysis, after which we would reach out to the investigators and tell them that if they want it, to take it now, since after a certain time we will either archive or delete it. Certain investigators require some physical infrastructure, the pipeline that allows data to be accessed & leveraged for some time after we've actually done our work with it. This is not part of our core capability but is important to note nonetheless. Happens ~5% of the time.
Roles / persons involved in collection BRCF
Schedule for known changes to Access Conditions none
Standards utilized none
Storage Format text binary
Subject Matter Expert(s) the team of bioinformatic analysts
Time Frame Data Collected For on a per-service basis
Update Schedule on a per-service basis