Friday, October 19, 2007

My solutions to Homework 2

Question:
Somewhere out on the Internet is a database of restriction enzymes.
a. Where is it located? What is the URL for the database file that could be used with the GCG software?
Answer: Database of restriction enzymes is located at REBASE.
The URL for the site is http://rebase.neb.com/rebase/rebase.html
The URL for the database file that could be used with the GCG software is http://rebase.neb.com/rebase/link_gcg

b. What does a typical entry look like for the restriction enzyme file that is formatted for use with the MacVector program?
Answer: Rebase Format #19 is used with the MacVector program.
Each entry is composed of lines. Different types of lines with their own formats are used to represent data. Each line begins with a two character line code which indicates the type of information provided in the line. “//” acts as the delimiter between individual entries.

Each entry in the database contains the following fields:
ID enzyme name
ET enzyme type
OS microorganism name
PT prototype
RS recognition sequence, cut site
MS methylation site (type)
CR commercial sources for the restriction enzyme
CM commercial sources for the methylase
RN [count]
RA authors
RL jour, vol, pages, year, etc.
//
Example of a typical entry:
ID M.BamHII
ET M
OS Bacillus amyloliquefaciens H
PT BamHII
RS GGATCC, ?;
MS 5;
RN [1]
RA Connaughton J.E., Vanek P.G., Chirikjian J.G.;
RL J. Cell Biol. 107:535a-535a(1988).
//

c. How is the database (formatted for MacVector) organized?
Answer: The database is organized in the form of a flatfile. It is a text only database with no graphics. It is in Bairoch format. It contains an alphabetical listing of types I, II and III restriction enzymes as well as methylases in a format that is compatible with a wide range of data banks (PROSITE, ENZYME, SwissProt, EMBL,ECD, EPD, HAEMB). Each entry is composed of lines. Different types of lines with their own formats are used to represent data. Each line begins with a two character line code which indicates the type of information provided in the line. “//” acts as the delimiter between individual entries.

1. What is the delimiter between individual restriction enzyme entries? How does the computer (or you) know when the information from one restriction enzyme stops and another one starts?
Answer: The delimiter between individual restriction enzyme entries is “//”.

2. Is this format similar to the format used by any other database? Which one?
Answer: I compared data in MacVector format with data in DNA Strider format in REBASE. Though similar in the fact that this format also provides information about restriction enzymes, and that data is organized in FASTA format, there are a few differences also, such as separation of fields etc, number of fields etc.

MacVector
DNA Strider
Each entry has many more descriptive fields than DNA Strider – enzyme name, enzyme type, organism name, prototype, recognition sequence, cut site, methylation site and commercial sources.
Each entry only has two descriptive fields – enzyme name, recognition sequence with cleavage site. Individual fields are separated by a comma (,)
Individual entry is separated by “//”
Individual entries start on a new line
Flatfile format
Flatfile format

Then I compared the MacVector format in REBASE database with the GenBank database: Though the format was similar in that both databases had a common delimiter “//”, most of the other attributes were very different.
MacVector
GenBank
Each entry starts with the ID field.
Each entry starts with the locus field.
Each field is represented by two characters line code such as ID, OS etc.
Each field is represented by one or more descriptive words such as definition, locus etc.
Information is only available in FASTA format.
Information is available in a wide range of formats such as FASTA, XML, Graphics etc.
Individual entries are separated by “//”
Individual entries are separated by “//”
It contains information about the restriction site of the enzyme and does not contain any information about the amino acid or nucleotide sequence
This database contains information about the nucleotide sequence. If coding for an expressed protein, it also contains the translated information.

Literature Search Questions
1) Select a protein and find the entries for this protein in the GenBank DNA database, the SwissProt database, and the PDB Protein database. List the attributes or features that are common to the databases and those which are unique to each.
Answer: I looked up the databases for β sub-unit of human follicle stimulating hormone.
PDB results:
URL: http://www.pdb.org/pdb/explore.do?structureId=1FL7
SwissProt results:
URL: http://au.expasy.org/uniprot/P01225
GenBank results:
URL: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=EF198021

GenBank
SwissProt
PDB
The identifying code for the protein is the locus or the accession number which is usually an eight digit alphanumerical.
The identifying code is called the primary accession number. It is usually a six digit alphanumerical.
The unique identifying code for the protein is small, usually a four digit alphanumerical.
Genbank contains the following information in each entry: Locus, definition, accession #, version, keywords, source,organism, references, gene, mRNA and CDS.
SwissProt contains the following information in each entry: entry name, primary accession number, information about the name and origin of the protein, references, links to cross-references and the amino acid sequence.
PDB contains the following information in each entry: title, references, history, experimental information, molecular description and information about the structure of the protein.
GenBank gives the DNA sequence of the protein. It also contains the translated sequence of the sequence, if it is expressed.
SwissProt gives the amino acid sequence of the protein. Other sequence information can be found by following the links.
PDB gives detailed structural (3D) information about the protein with images and figures to help visualize the molecule. It also contains the amino acid sequence.
GenBank does not provide links for cross-referencing.
SwissProt has many links which allow for easy cross-referencing.
PDB also allows cross referencing via “external links”.
GenBank is a much larger database than SwissProt or PDB.
Swissprot is not as big a database as GenBank, but is bigger than PDB.
PDB is a relatively small database as many proteins that were available at SwissProt and GenBank could not be found here.
Information can be displayed in a wide array of formats such as FASTA, GenBank, XML, Graphical etc.
Sequence information is available only in two formats: SwissProt and FASTA. There is no graphical representation of data.
Sequence information is available in FASTA format. However, there is also ample graphical representation of the data.



2) How many secreted proteins have been discovered in humans? Explain what database you used, and what keywords you used to do the search.
Answer: I performed this search in SRS. I initially searched multiple databases, but the results were redundant, so I repeated the search using only one dataset, the patent proteins dataset, since results would not be duplicated here, and also because most secreted proteins would be entered here.
URL: http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz
Database searched: Patent Proteins
Search Field 1: All text - Secreted
Search Field 2: Organism name - human
Result: 10,319

No comments: