Download - Kristen 1scsasc
-
8/12/2019 Kristen 1scsasc
1/35
Bioinformatics Databases:Fundamental Concepts of
Database Technology & Data
Organization
Kristen Anton
Director of BioInformatics
Dartmouth Medical School
BioInformatics @ Dartmouth Medical School
-
8/12/2019 Kristen 1scsasc
2/35
BioInformatics @ Dartmouth Medical School
How can data be organized? Paper (i.e. in notebooks) Flat files
Collection of data records Minimal structure, no metadata Application program must contain relationship
information
Database HierarchicalNetwork Relational
-
8/12/2019 Kristen 1scsasc
3/35
BioInformatics @ Dartmouth Medical School
-
8/12/2019 Kristen 1scsasc
4/35
-
8/12/2019 Kristen 1scsasc
5/35
BioInformatics @ Dartmouth Medical School
What is a relational database?A database composed of relations and conforming
to a set of principles governing how such relations
are supposed to behave (Codds 12 Rules).
There are many database systems that use tables
but dont conform to all of the principles.
These are often called semirelational systems.
from Understanding SQL, Martin Gruber
-
8/12/2019 Kristen 1scsasc
6/35
BioInformatics @ Dartmouth Medical School
Practically speaking...
A database is a body of information stored in twodimensions (rows and columns)
Rows are records Columns are attributes of those record entities
(usually!)
The groups of rows and columns, or tables, arelargely independent of each other
The power of the database lies in the relationshipsthat you construct among the tables
A database is self-describing: it contains metadata,which is a description of its own structure
-
8/12/2019 Kristen 1scsasc
7/35
A set of programs which define, administer andprocess databases and their associated applications
A scalable DBMS can run on multiple platforms(varying sizes)
A DBMS that supports interoperability usesindustry-standard language and standard ways ofexchanging data
What is a Database Management
System (DBMS)?
Examples: Oracle, Sybase, 4D, MS Access
BioInformatics @ Dartmouth Medical School
-
8/12/2019 Kristen 1scsasc
8/35
Features of a Relational Database
Rows (records) are in no particular order Columns (fields) are ordered, numbered and
named; names should indicate content of thefield
Primary key uniquely identifies each row -ensures that no row is empty, and that every
row is different from every other row
Two-step commit processBioInformatics @ Dartmouth Medical School
-
8/12/2019 Kristen 1scsasc
9/35
-
8/12/2019 Kristen 1scsasc
10/35
-
8/12/2019 Kristen 1scsasc
11/35
-
8/12/2019 Kristen 1scsasc
12/35
The tool for communicating with
relational databases: SQL Standard Query Language (SQL) A query is a question you ask the database,
and SQL retrieves the appropriate answer
set
Interactive SQL (command line) vs. RADtool/GUI
Standardization issue: ANSI (AmericanNational Standards Institute)
BioInformatics @ Dartmouth Medical School
-
8/12/2019 Kristen 1scsasc
13/35
Data Types
Types of data indicate functions that arepossible between related fields
Each field is assigned one data type(imposes structure on data)
Examples: text (CHAR, VARCHAR),number (INT, DEC); date, time, money
binary Standardization issue: ANSI (American
National Standards Institute)BioInformatics @ Dartmouth Medical School
-
8/12/2019 Kristen 1scsasc
14/35
Designing a database is not trivial The value is not in the data, but in the
structure
Design to facilitate the retrieval andinterpretation of the data
BioInformatics @ Dartmouth Medical School
A word about database design:
-
8/12/2019 Kristen 1scsasc
15/35
-
8/12/2019 Kristen 1scsasc
16/35
-
8/12/2019 Kristen 1scsasc
17/35
BioInformatics @ Dartmouth Medical School
Design database for data
extraction: think it through
-
8/12/2019 Kristen 1scsasc
18/35
BioInformatics @ Dartmouth Medical School
Design database for data
extraction: think it through
-
8/12/2019 Kristen 1scsasc
19/35
Reusable core modules, withcustomizable components
Standard business logic frameworkcontrols transactions (middle layer)
Metadata-based back-end data storage(facilitates data sharing)
BioInformatics @ Dartmouth Medical School
Example: BioInformatics Core
Technology
-
8/12/2019 Kristen 1scsasc
20/35
BioInformatics @ Dartmouth Medical School
BioInformatics Core Technology
-
8/12/2019 Kristen 1scsasc
21/35
Data Security: High Priority
BioInformatics @ Dartmouth Medical School
HIPAA,
FIPS 140-2(VA), IRB
requirements
-
8/12/2019 Kristen 1scsasc
22/35
Life science has become a fieldwhich generates an enormous
amount of un-integrated data.
BioInformatics @ Dartmouth Medical School
How can methods for data
organization help to solve this
problem?
-
8/12/2019 Kristen 1scsasc
23/35
BioInformatics @ Dartmouth Medical School
What is Data Integration?
Creating a system which allows theextraction of a piece or set of information(query result) across multiple domains
(possibly disparate data sources - flat files,
databases, spreadsheets, URLs...)
-
8/12/2019 Kristen 1scsasc
24/35
-
8/12/2019 Kristen 1scsasc
25/35
BioInformatics @ Dartmouth Medical School
Understanding transcription
factors for protein x productionShow me all genes in the public literature that are putatively
related to protein x, have more than 4-fold expression
differential between affected and normal tissue and are
homologous to known transcription factors.
Q1: Find homologsQ2: Find genes with
4-fold differential
Q3: Show me genes
in public literature
SEQUENCE EXPRESSION LITERATURE
(Q1!Q2!Q3)
-
8/12/2019 Kristen 1scsasc
26/35
-
8/12/2019 Kristen 1scsasc
27/35
BioInformatics @ Dartmouth Medical School
Approaches to Integration
where are the key issues addressed? Federated database (poses constraints on original
data sources; fragility in reliance on source
systems)
Data warehousing (ETL layer, original datasources untouched, required understanding of
domain, sophisticated update/archive processes)
Integrating data source profiles Indexed Flat Files Others.
-
8/12/2019 Kristen 1scsasc
28/35
BioInformatics @ Dartmouth Medical School
Data Warehousing
-
8/12/2019 Kristen 1scsasc
29/35
-
8/12/2019 Kristen 1scsasc
30/35
BioInformatics @ Dartmouth Medical School
Data value: 55
Metadata values:
Data element name: vehicle speed
Describes data types, relationships,histories, etc.
Back-end (supports developers), front-end(supports users and application)
Metadataone key to success
-
8/12/2019 Kristen 1scsasc
31/35
BioInformatics @ Dartmouth Medical School
Data value: 55
Metadata values:
Data element name: vehicle speed
Unit: miles per hour
Describes data types, relationships,histories, etc.
Back-end (supports developers), front-end(supports users and application)
Metadataone key to success
-
8/12/2019 Kristen 1scsasc
32/35
BioInformatics @ Dartmouth Medical School
Data value: 55
Metadata values:
Data element name: vehicle speed
Unit: miles per hour
Description: the average velocity of a
vehicle
Describes data types, relationships,histories, etc.
Back-end (supports developers), front-end(supports users and application)
Metadataone key to success
-
8/12/2019 Kristen 1scsasc
33/35
BioInformatics @ Dartmouth Medical School
Standards
the final frontier
Naming conventions Standard coordinate systems Unify interpretations of single object types Unify software solutions to the same
problem (also data formats)
Standards for metadata (incompatible ormissing metadata)
-
8/12/2019 Kristen 1scsasc
34/35
-
8/12/2019 Kristen 1scsasc
35/35
New approach to integration:
Cancer Biomarker Discovery Network of distributed data silos (does not
perturb data sources)
Centralized query and business logic servers,accessed through web interface
CORBA framework manages XML profiledefinitions across the web
A profile is a set of resource definitionsimplemented in XML for data sources residing inone or more distributed systems
BioInformatics @ Dartmouth Medical School