NCBI News

March 1995

BankIt: GenBank Offers New Submission Tool on WWW
GenBank Adding 1,000 Human Sequences per Day from Merck Project
Entrez on the Net: Use of Internet Versions Encouraged
MMDB: A Molecular Modeling Database
GenBank Fellows Begin Research
Frequently Asked Questions
Selected Recent Publications by NCBI Staff
GenBank: Easy Deposits, Unlimited Withdrawals, High Interest
Masthead

BankIt: GenBank Offers New Submission Tool on WWW

GenBank users may now use the World Wide Web (WWW) to submit sequences to GenBank. The NCBI is pleased to announce BankIt, a new Web-based data submission tool that provides a simple forms-based method for submitting sequences to GenBank. BankIt was developed by GenBank in conjunction with its international collaborating databases (EMBL and DDBJ), and it is anticipated that EMBL and DDBJ will offer similar services in the near future. With BankIt, your data will be submitted directly to GenBank, then forwarded for inclusion in the EMBL and DDBJ databases.

With BankIt, you enter sequence information into a form, edit as necessary, and add biological annotation (e.g., coding regions, mRNA features). Free-form text boxes provide the option of using your own words to describe the sequence, without having to learn formatting rules or use restricted vocabularies. BankIt creates a draft record in GenBank format for you to review and revise as you wish. When the record is completed to your satisfaction, you may submit it to GenBank with the click of a button. The GenBank annotation staff will review your draft submission and issue your accession number within 24 hours.

Quality assurance is a major focus at GenBank. After issuing your accession number, GenBank annotation staff will continue processing your entry to ensure that the biological descriptions are accurate and complete and to perform the quality checks that are routinely applied to GenBank submissions. This includes taxonomy look-ups, links to existing MEDLINE records, checks for coding sequences, tests for vector contamination, and an overall review of biological content. Each entry is checked by at least three biologists. When the quality control is completed, GenBank will send you the entry for your review and final approval.

No special software is needed to use BankIt other than a WWW browser. BankIt may be used with the following platforms and browsers:

Unix: Netscape and Mosaic
Mac: Netscape and MacWeb
PC/Windows:Netscape

The current release of Mosaic for the PC and Mac does not work with BankIt. To access BankIt, use your Web browser to connect to the NCBI Home Page. The URL: is http://www.ncbi.nlm.nih.gov/ In addition to BankIt, the NCBI Home Page also serves as a starting point for searching GenBank on the Web and obtaining information about NCBI databases and services. If you have any questions on using BankIt, please contact the GenBank support staff at info@ncbi.nlm.nih.gov or at (301) 496-2475.

Return to Table of Contents

GenBank Adding 1,000 Human Sequences per Day from Merck Project

As part of a major sequencing project funded by Merck and Company, the Genome Sequencing Center at Washington University in St. Louis, directed by Robert Waterston, is adding approximately 1,000 human cDNA sequences ("Expressed Sequence Tags" or ESTs) per day to GenBank. J. Craig Venter, formerly of NIH but now President and Director of The Institute for Genomic Research, pioneered the expressed gene survey concept 4 years ago (Science 252:1651, 1991). In October 1994, Merck and Company announced the goal of developing a complete set of human mRNA sequences to be accessible as a public resource (see Nature 372:10, 1994). Based on its reputation for megabase sequencing of the C. elegans nematode genome, Waterston's center was selected by Merck to do the sequencing of approximately 200,000 clones from a number of different tissue-specific cDNA libraries. These clones are being placed on high-density arrays by Greg Lennon at Lawrence Livermore National Laboratory, and the physical DNA clones (as well as the derived sequences) will be available without restriction. Each clone will be sequenced from both the 5 and 3 ends, resulting in about 300-450 nucleotides of "single pass" sequence from each end. The goal is to obtain up to 400,000 sequences over 18 months.

The sequence data is submitted daily to dbEST (the EST division of GenBank; see Science 265:1993, 1994) and distributed to the public by all the existing search and retrieval mechanisms. In addition, there are plans to develop and make available "assemblies" or "contigs" of sequences that appear to derive from the same mRNA. The NCBI is also working with high-throughput mapping groups to develop a transcript or expression map of the genome based upon ESTs and other gene-based markers derived from existing human GenBank sequences.

The first 15,000 sequences from the project, known as the Merck Gene Index, were released to the public on February 10. As of March 9, dbEST contained 90,506 human sequences, contributed by 43 different laboratories. Three groups have contributed more than 10,000 sequences. These are the Merck Gene Index with 34,301, Genethon with 21,245, and The Institute for Genomic Research with 16,049.

Through a collaborative project between the NCBI and the European Bioinformatics Institute (EBI) near Cambridge, England, dbEST is being mirrored at EBI to provide more convenient access for European users. dbEST is available through the EBI WWW server (http://www.ebi.ac.uk/), NCBI's WWW server (http://www.ncbi.nlm.nih.gov/), and NCBI's RETRIEVE e-mail server. dbEST sequences are also available as a "line item" database for searches using NCBI's BLAST servers. The data and updates can also be obtained by anonymous FTP at ncbi.nlm.nih.gov.

Return to Table of Contents

Entrez on the Net: Use of Internet Versions Encouraged

The CD-ROM version of Entrez expanded from two discs to three in October 1994. The rapid influx of data resulting from the Merck Gene Index (see article on page 1) will require a fourth Entrez disc in June of this year, two releases earlier than originally expected. As the number of CD-ROMs continues to increase, subscribers might find alternative means of access to Entrez more convenient and economical.

The NCBI now provides Entrez in three versions. The original version is the CD-ROM. The other two versions both rely on the Internet: Network Entrez has a look and feel identical to the CD-ROM version; Web Entrez has the same capabilities as the Network version, but the user interface has been adapted for use with the Internet information resource known as the World Wide Web (WWW).

Entrez on the Internet
Both Network Entrez and Web Entrez are examples of what are called client-server applications. The "client" part is the software that runs on your own Macintosh, PC, or other workstation to manage the graphical display and to accept user input. In the case of Network Entrez, the client program is Entrez itself. For Web Entrez, the client program is a WWW browser, such as Mosaic, Netscape, MacWeb, Cello, or Lynx. When searches are performed, the client program on your local computer contacts a database "server" system at the NCBI. Records from the Entrez databases of DNA and protein sequences and related MEDLINE citations are transmitted from the server via the Internet to your local computer for display and further processing. Depending on the speed of your Internet connection, the responsiveness of Network and Web Entrez may be as good or better than the CD-ROM version.

Internet access to Entrez has several advantages over the CD-ROM version. First, access to both Network and Web Entrez is free; there are no subscription charges. Second, CD-ROM drives are not required and only a small amount of local hard disk storage is necessary. Third, because the storage limitations of the CD-ROM discs are removed, we can offer access to a broader sequence-related subset of MEDLINE_approximately five times the size of the CD-ROM subset. Soon there will be two additional advantages to the Internet versions: daily database updates, rather than at 2-month CD-ROM intervals, and the ability to view three-dimensional structures (see MMDB on page 4).

To run Network Entrez or Web Entrez, you will need a direct TCP/IP connection to the Internet (an "e-mail only" connection is not sufficient). If your institution does not have a direct Internet connection, there are likely to be commercial Internet access providers in your area who can provide low-cost dial-up connections.

If you have the appropriate network connection, the Network and Web versions of Entrez have slightly different start-up requirements. To use Web Entrez, all you need is a Web browser (ask your network or system administrator) and the NCBI URL. A URL (Uniform Resource Locator) is the Web equivalent of an e-mail address. The URL for the NCBI's World Wide Web site is http://www.ncbi.nlm.nih.gov/. When your browsing program connects to the NCBI Web site, also called the "Home Page," select the "Searching GenBank and Other Databases" option. Then, from the "Searching GenBank" page, select "Entrez Browser."

To get started with Network Entrez, your network or system administrator should send e-mail to net-info@ncbi.nlm.nih.gov requesting free registration. Your administrator will receive instructions on installing Network Entrez by return e-mail.

SLIPping Into Entrez
If high-speed direct Internet connections are not currently possible for you, access to a local Internet provider by modem is becoming increasingly easy and inexpensive to obtain. Modem access requires at least a 9,600 baud modem (and preferably 14,400 baud or faster) and special software that provides the communications links used by the Internet. Specifically, the communications software will create either a SLIP (Serial Line Internet Protocol) or a PPP (Point-to-Point Protocol) connection between your local desktop Mac, PC, or other workstation and a computer or router on the Internet. If you use a commercial Internet provider, it will probably provide the necessary software and can assist you with software installation and configuration. Charges will vary depending on your location and the type of service you obtain. In the Washington, DC, area, monthly charges for unlimited single-user SLIP/PPP access are typically in the $20-$40 range. Less expensive monthly rates may be accompanied by a connect-time charge.

Lists of Internet providers can be obtained from a number of sources on the Internet, such as through anonymous FTP from nis.nsf.net in the internet/providers directory or by connecting to the Web URL at ftp://nis.nsf.net/internet/providers/. Information about commercial providers is also included in the many recent books about using the Internet. NCBI can furnish a list of providers upon request.

Return to Table of Contents

MMDB: A Molecular Modeling Database

A team of NCBI scientists led by Steve Bryant has developed the Molecular Modeling Database (MMDB), a database that organizes and presents macromolecular 3-D structure information in a way convenient for conducting molecular modeling research. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction.

Building the Database
MMDB's source of data is the Brookhaven Protein Data Bank (PDB). MMDB reorganizes and validates the information in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. While the PDB data model provides an elegant and concise description of a crystal structure, there is no one-to-one correspondence between a site, a structure, and an atom in the chemical sense. MMDB provides this chemical information in an explicit manner. Its data specification includes a description of a biopolymer's spatial structure, a description of how it is organized chemically, and a set of pointers linking the two.

The first step in creating MMDB is getting an accurate sequence that is consistent with the atom site coordinates in PDB. The next step is to construct a complete chemical graph for the molecule, representing all bonds and chirality. An important component of this second step matches the amino acid and nucleotide groups defined by PDB against a dictionary that defines all bond and atom types. The third and final step is to recover disorder information in the structure.

Using MMDB To Enhance Entrez
The result of linking chemistry and 3-D structure is that, using public domain graphics packages such as Kinemage and RasMol, Network Entrez can be used to display a 3-D image of protein data. NCBI expects to introduce this feature by mid-1995. Network Entrez users who have the appropriate graphics software will then be able to view the 3-D structure of proteins identified via text or sequence similarity searching. Somewhat later, after further research and testing, NCBI anticipates adding a "structural neighbors" feature to Network Entrez, which will identify proteins that appear to be related by a combination of strong sequence and structural similarity. In this way, we expect that MMDB will serve as a useful resource for homology modeling and protein structure prediction.

Tool for Software Developers
MMDB organizes macromolecular 3-D structure information in a form convenient for developers of molecular modeling software. MMDB data are specified in ASN.1 and provided as data structures addressable in the C language via I/O routines provided by NCBI. A beta version of MMDB is now available as ASN.1 tag-value files, with associated object loaders to facilitate programming access from the C language. Sample application programs making use of MMDB and the object loaders are also available. The current specification, data files, and software for loading the structures are available by anonymous FTP from ncbi.nlm.nih.gov in the pub/mmdb directory.

Return to Table of Contents

GenBank Fellows Begin Research

Five GenBank Fellows have been selected and are pursuing various applied research projects to improve the quality of GenBank entries, reduce sequence redundancy, refine the GenBank taxonomy, and establish links to databases containing mapping data and three-dimensional macromolecular structures.

The GenBank Fellowship Program, an NCBI initiative to improve the quality of the database and provide training in bioinformatics, selects scientists with strong backgrounds in biology and an interest in applying computational tools to research problems in molecular and structural biology, genetics, and phylogeny. In addition to these general qualifications, the Fellows bring diverse backgrounds and research interests to the NCBI.

Christopher Hogue received a bachelor's degree in biochemistry from the University of Windsor and a Ph.D. in biochemistry from the University of Ottawa. For his thesis research, performed in the fluorescence spectroscopy lab of Professor Arthur G. Szabo at the National Research Council of Canada, he developed methods for the biosynthetic incorporation of analog amino acids into proteins as intrinsic fluorescent probes. As a GenBank Fellow, Hogue is working with Steve Bryant helping to develop software tools for "threading" protein structure prediction techniques, which he is using to examine the structure/function relationships within specific protein families.

Detlef Leipe graduated from the Free University of Berlin with a master's degree in zoology and a Ph.D. in biology. His doctoral research, conducted in the laboratory of Professor K. Hausmann, was an ultrastructural study on the nuclei of ciliated protozoa. As a postdoc with Mitchell Sogin at the Marine Biological Laboratory in Woods Hole, MA, Leipe used molecular methods to study the evolution of eukaryotic microorganisms. He is currently conducting research on the molecular evolution of unicellular eukaryotes and working on the GenBank taxonomy project.

Wojciech Makalowski received master's degrees in molecular biology and philosophy of science and a Ph.D. in molecular biology from the Adam Mickiewicz University at Poznan in Poland. He joined the NCBI following a postdoc with Damian Labuda at Montreal University. Makalowski is interested in using biological sequence information in research on genomic evolution. He is currently studying dispersed repetitive elements in GenBank entries and conducting a comparative analysis of mouse and human gene sequences.

Heidi Sofia graduated from the University of California at Berkeley with a bachelor's degree in biochemstry and a master's in biochemical toxicology. She earned a Ph.D. in biochemistry from the University of Wisconsin at Madison, then did a postdoc with Fred Blattner in the E. coli Genome Project before moving to the NCBI as a GenBank Fellow. Her current research interests are high throughput analysis of EST sequences and vector contamination in GenBank.

Jane Weisemann received a bachelor's degree in biochemistry from the University of Maryland and a master's degree in biomedical sciences from Hood College. She earned a Ph.D. in molecular biology from the University of Texas Health Science Center Houston. She worked as a postdoc in the Plant Molecular Biology Laboratory at the USDA Beltsville Agricultural Research Center and then as a GenBank sequence annotator at the National Library of Medicine before joining the NCBI. Weisemann is studying the comprehensiveness and consistency of mapping information in GenBank entries and exploring ways to link information of this sort with other databases.

Photo Caption / Photo not yet scanned
NCBI GenBank Fellows are, standing from left, Heidi Sofia, Christopher Hogue, and Detlef Leipe. In the front row are Wojciech Makalowski and Jane Weisemann.

Return to Table of Contents

Frequently Asked Questions

I would like to obtain a recent GenBank record that I believe was added since the last full release. Do I send a message to the update e-mail address?

No. The update e-mail address (update@ncbi.nlm.nih.gov) is used for making changes to existing GenBank entries or for requesting that a published sequence be released to the public. To obtain a recently released GenBank record, use the RETRIEVE e-mail server (retrieve@ncbi.nlm.nih.gov). The full GenBank database, including the most recent additions, will be searched automatically.

When I try to retrieve database records with accession numbers that begin with the letter A, P, S, or Q, I often get zero documents using the RETRIEVE e-mail server. These accession numbers were part of my BLAST search results, so I know they must be there somewhere.

Entries with such accession numbers can come from databases other than GenBank. In the BLAST report of high scoring segment pairs, sequences are identified with a database code, accession number, and locus name. In the example "sp | P01013 | OVAX_CHICK", the database code "sp" signifies that this entry is in the Swiss-Prot database. Database codes are discussed in section 8 of the BLAST HELP documentation. When doing a RETRIEVE search to obtain the full records, specify the correct database with the DATALIB command, e.g., DATALIB sp instead of DATALIB genbank.

I have an old copy of the Authorin manual, and I'm not sure if the addresses and phone numbers listed for GenBank are correct.

The correct address for submitting entries to GenBank by e-mail is gb-sub@ncbi.nlm.nih.gov. The correct address to submit GenBank entries by postal mail is GenBank Submissions, National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, MD 20894. For technical assistance, the e-mail address is authorin@ncbi.nlm.nih.gov and the phone number is (301) 496-2475.

I tried using BankIt, but some of the buttons and forms don't appear where the documentation says they should. How can I use this system?

You are most likely using a World Wide Web browser that is not compatible with the BankIt form. Note that on Mac and PC computers, the current release of Mosaic will not work with BankIt. Use an alternative such as Netscape or MacWeb. (See the BankIt article beginning on page 1.)

I sent a search to the RETRIEVE server to get the GenBank entry for ssp5. I got back more than 30,000 records and they are not ssp5. There should only be one or two.

The most likely reason is that your e-mail program automatically appends a signature block to your message. If there is no blank line between the search and your signature block, then all the words, numbers, and initials in your signature are included as part of the search. To avoid this, just hit the carriage return a couple of times at the end of your search to insert blank lines.

Return to Table of Contents

Selected Recent Publications by NCBI Staff

Ahmad, N, BM Baroudy, RC Baker, and C Chappey. Genetic analysis of immunodeficiency virus type 1 envelope V3 region from mother and infant isolates following perinatal transmission. J Virol 69:1001-12, 1995.

Boguski, MS, CM Tolstoshev, and D Bassett. Gene discovery in dbEST. Science 265:1993-4, 1994.

Borodovsky, M, EV Koonin, and KE Rudd. New genes in old sequence: a strategy for finding genes in the bacterial genome. Trends Biochem Sci 19:309-13, 1994.

Clark, MW, T Keng, RK Storms, W Zhong, N Fortin, S Delaney, BFF Ouellette, AB Barton, DB Kaback, and H Bussey. Sequencing of chromosome I of Saccharomyces cerevisiae: analysis of the 42 kbp SPO7-CENI-CDC15 region. Yeast 10:535-41, 1994.

Claverie, JM, and W Makalowski. Alu alert. Nature 371:752,1994.

Hu, HM, K O'Rourke, MS Boguski, and VM Dixit. A novel RING finger protein interacts with the cytoplasmic domain of CD40. J Biol Chem 269:30069-72, 1994.

Rudd, KE, PE Rouviere, S Lazar, H Sofia, G Plunkett, and EV Koonin. A new family of peptidyl-prolyl isomerases. Trends Biochem Sci 20:12-4,1995.

Tatusov, RL, SF Altschul, and EV Koonin. Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci USA 91: 12091-5, 1994.

Tatusov, RL, and EV Koonin. A simple tool to search for sequence motifs in BLAST outputs. Comput Appl Biosci 10:457-9, 1994.

Wootton, JC. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269-85, 1994.

Return to Table of Contents

GenBank: Easy Deposits, Unlimited Withdrawals, High Interest

It's easy and free to contribute sequences to GenBank and search the database. This list summarizes the data submission and search services available from NCBI.

Information
Purpose: Obtain general information about NCBI databases and services
How To Use/How To Get Help: Send e-mail to info@ncbi.nlm.nih.gov or call the Service Desk at (301) 496-2475.

GenBank Submissions
Purpose: Submit new sequences to GenBank.
How To Use/How To Get Help: For information or technical assistance: info@ncbi.nlm.nih.gov

Service: Authorin Software
Purpose: Prepare new or updated GenBank entry.
How To Use/How To Get Help: Send a new submission by e-mail: gb-sub@ncbi.nlm.nih.gov. To obtain software for Mac or PC, send request to: authorin@ncbi.nlm.nih.gov

Service: BankIt on WWW
Purpose: Prepare and submit new GenBank entry over the Internet, using the World Wide Web.
How To Use/How To Get Help: For information on compatible WWW browsers: info@ncbi.nlm.nih.gov
To access BankIt through NCBI Home Page: http://www.ncbi.nlm.nih.gov/

GenBank Updates
Purpose: Correct or update an existing sequence; request release of published data.
How To Use/How To Get Help: Send an update request by e-mail: update@ncbi.nlm.nih.gov

E-mail servers
Service: retrieve@ncbi.nlm.nih.gov
Purpose: Retrieve GenBank and other sequence data-base records from an e-mail server based on any text term, including accession number, author name, locus, gene name, etc.
How To Use/How To Get Help: To receive documentation, send a message containing only the word HELP. For personal assistance, send e-mail to: retrieve-help@ncbi.nlm.nih.gov

Service: blast@ncbi.nlm.nih.gov
Purpose: Perform a sequence similarity search of GenBank and other sequence databases using the BLAST algorithm.
How To Use/How To Get Help: To receive documentation, send a message containing only the word HELP. For personal assistance, send e-mail to: blast-help@ncbi.nlm.nih.gov

Internet applications
Purpose: "Client-server" programs, in which client program on local PC, Mac, or Unix work-station queries NCBI server via the network.
How To Use/How To Get Help: All NCBI network applications require Internet access and locally installed TCP/IP software.

Service: Network Entrez
Purpose: Point-and-click retrieval system for workstations. Provides text-based searching of sequence databases and a sequence-related subset of MEDLINE.
How To Use/How To Get Help: To register and obtain client software, PCs running Windows, Macs, and Unix send e-mail to: net-info@ncbi.nlm.nih.gov

Service: Network BLAST
Purpose: Interactive BLAST similarity searching for PC (DOS), Mac, Unix, and VMS workstations.
How To Use/How To Get Help: To register and obtain client software, send e-mail to: blast-help@ncbi.nlm.nih.gov

Service: World Wide Web access
Purpose: WWW access to NCBI databases and search services, including BankIt for GenBank submissions and Web versions of RETRIEVE, BLAST, and Entrez.
How To Use/How To Get Help: For information on compatible WWW browsers: info@ncbi.nlm.nih.gov
To access NCBI Home Page: http://www.ncbi.nlm.nih.gov/

Anonymous FTP: ncbi.nlm.nih.gov
Purpose: Obtain GenBank releases, NCBI software, and various molecular biology databases.
How To Use/How To Get Help: Login as "anonymous" (unquoted) and enter your e-mail address as your password.

CD-ROMs
Purpose: For users who do not have Internet access or who prefer a local copy of databases.
How To Use/How To Get Help: For information about subscriptions, send e-mail to: info@ncbi.nlm.nih.gov

Service: Entrez (GPO list ID: ENT)
Purpose: CD-ROM version of Network Entrez. Annual e-mail subscription (6 issues, 3 discs/issue): $102.
How To Use/How To Get Help: For technical assistance, send questions to: entrez@ncbi.nlm.nih.gov

Service: GenBank (GPO list ID: NCBIF)
Purpose: GenBank in "flat-file" format, as used by some commercial and academic software. Annual subscription (6 issues, 2 discs/issue): $66.
How To Use/How To Get Help: Send e-mail to: info@ncbi.nlm.nih.gov
Return to Table of Contents

Masthead

NCBI News is distributed three times a year. We welcome communication from users of NCBI databases and software and invite suggestions for articles in future issues. Send correspondence and suggestions to NCBI News at the address below.

NCBI News
National Library of Medicine
Bldg. 38A, Room 8N-803
8600 Rockville Pike
Bethesda, MD 20894
Phone: (301) 496-2475
Fax: (301) 480-9241
E-mail: info@ncbi.nlm.nih.gov

Editors

Dennis Benson
Barbara Rapp

Design Consultant

Troy M. Hill

Photography

Karlton Jackson

Editing, Graphics, and Production

Veronica Johnson
Wendy B. Osborne

In 1988, Congress established the National Center for Biotechnology Information as part of the National Library of Medicine; its charge is to create automated systems for storing molecular biology, biochemistry, and genetics data, and to perform research in computational molecular biology.

The contents of this newsletter may be reprinted without permission. The mention of trade names, commercial products, or organizations does not imply endorsement by NCBI, NIH, or the U.S. Government.
NIH Publication No. 95-3272
ISSN 1060-8788

Return to Table of Contents