NCERT grounding
Section 5.9 of the NCERT Class XII Biology chapter Molecular Basis of Inheritance opens the Human Genome Project (HGP) by recalling a single premise from the earlier sections: it is the sequence of bases in DNA that determines the genetic information of an organism. If two individuals differ, their DNA sequences must differ somewhere. NCERT states that this assumption — combined with genetic engineering techniques to isolate and clone any piece of DNA, and fast methods to read DNA sequences — led to the launch of "a very ambitious project of sequencing human genome" in the year 1990.
The textbook explicitly calls HGP a mega project and lists its goals, its 13-year duration, the coordinating bodies, the two methodologies, the salient features of the genome, and its applications. Every figure quoted on this page — 3164.7 million base pairs, ~30,000 genes, 99.9% identity, 1.4 million SNP locations — is taken verbatim from NCERT section 5.9.1, so none of it should be rounded or rephrased in an exam answer.
"Human genome is said to have approximately 3 × 109 bp, and if the cost of sequencing required is US $3 per bp… the total estimated cost of the project would be approximately 9 billion US dollars."
— NCERT, Molecular Basis of Inheritance, Section 5.9
Inside the Human Genome Project
The Human Genome Project was conceived as an attempt to sequence every base in the human genome. NCERT frames it as a mega project for three concrete reasons: its scale, its cost, and its data burden. The human genome contains approximately 3 × 109 base pairs. At the early estimated cost of US $3 per base pair, the total projected cost came to roughly 9 billion US dollars. If the obtained sequence were printed in books — 1000 letters per page, 1000 pages per book — about 3300 books would be needed just to store the DNA sequence of a single human cell. That volume of data made high-speed computational devices for storage, retrieval and analysis unavoidable, and so HGP was closely associated with the rapid growth of a new field, bioinformatics.
Books to store one cell's sequence
With 1000 letters per page and 1000 pages per book, the DNA sequence from a single human cell would fill about 3300 such books — NCERT's illustration of why HGP demanded bioinformatics.
Launch, coordination and timeline
HGP was a 13-year project launched in 1990 and completed in 2003. It was coordinated by the U.S. Department of Energy and the National Institute of Health. During the early years the Wellcome Trust (U.K.) became a major partner, and additional contributions came from Japan, France, Germany, China and others. The project was not limited to humans: many non-human model organisms — bacteria, yeast, Caenorhabditis elegans (a free-living non-pathogenic nematode), Drosophila (the fruit fly), and plants such as rice and Arabidopsis — were also sequenced.
HGP timeline — launch to last chromosome
-
1990
Project launched
Sequencing of the human genome begins as a mega project.
Start -
13 years
Coordinated effort
U.S. DoE and NIH coordinate; Wellcome Trust and others join.
Multi-nation -
2003
Project completed
Sequencing essentially finished after 13 years.
Completed -
May 2006
Chromosome 1 done
Last of the 24 chromosomes (22 autosomes + X, Y) sequenced.
Final piece
One detail worth noting carefully: although the project was completed in 2003, the sequence of chromosome 1 was finished only in May 2006. NCERT records that chromosome 1 was the last of the 24 human chromosomes — 22 autosomes plus X and Y — to be sequenced. Students often assume "completed in 2003" means every chromosome was done by then; the chromosome 1 detail shows the finishing touches extended beyond the headline year.
Goals of HGP
NCERT lists six important goals. They are best remembered as a grid rather than a prose list, because NEET can probe any single item — especially the gene-count goal and the ELSI goal.
Exam note: the goals quote a gene target of 20,000–25,000 genes; the salient feature later in the chapter states the observed estimate as ~30,000 genes. NCERT uses both figures — quote the one the question asks for.
Identify all genes
Identify the approximately 20,000–25,000 genes in human DNA.
Sequence the base pairs
Determine the sequence of the 3 billion base pairs making up human DNA.
Store & analyse data
Store the information in databases and improve tools for data analysis.
Transfer & address ELSI
Transfer technologies to industry; address ethical, legal and social issues (ELSI).
How the genome was actually sequenced
DNA is an extremely long polymer, and there are technical limits to reading very long pieces. So the working protocol was a fragment-and-reassemble approach. Total DNA from a cell was isolated and converted into random fragments of relatively smaller sizes. These fragments were cloned in a suitable host using specialised vectors, which amplified each fragment so it could be sequenced with ease. The commonly used hosts were bacteria and yeast; the vectors were BAC (bacterial artificial chromosomes) and YAC (yeast artificial chromosomes).
The fragments were then read by automated DNA sequencers that worked on the principle of a method developed by Frederick Sanger — the same Sanger credited with developing a method for determining amino acid sequences in proteins. The individual sequences were arranged using overlapping regions present in them, which required deliberately generating overlapping fragments. Aligning these sequences was humanly not possible, so specialised computer programs were developed. The aligned sequences were then annotated and assigned to each chromosome. A further task was assigning genetic and physical maps to the genome, generated using polymorphism of restriction endonuclease recognition sites and repetitive DNA sequences known as microsatellites.
Figure 1. The sequencing workflow — total DNA is broken into random fragments, cloned and amplified in BAC/YAC vectors, read by automated Sanger sequencers, and finally stitched together by computer programs that exploit overlapping regions before each stretch is annotated to a chromosome.
Two sequencing strategies
NCERT states that the methods involved two major approaches. This comparison is a frequent NEET target, so the distinction must be sharp. One approach was gene-focused; the other was genome-wide and "blind".
The first approach focused on identifying all the genes that are expressed as RNA. These were referred to as Expressed Sequence Tags (ESTs). By concentrating only on transcribed sequences, this strategy went straight for the genes and ignored the vast non-coding portion of the genome. The second approach took the "blind" route of simply sequencing the whole set of the genome — all the coding and non-coding sequence — and only later assigning functions to different regions of the sequence. NCERT calls this later step Sequence Annotation.
Expressed Sequence Tags (ESTs)
Gene-first
Targets expressed DNA
- Focuses on identifying all genes expressed as RNA
- Ignores the non-coding portion of the genome
- Efficient route straight to the functional genes
- NCERT: "identifying all the genes that are expressed as RNA"
Sequence Annotation
Genome-first
Blind whole-genome approach
- Sequences the whole genome — coding and non-coding
- Functions assigned to regions after sequencing
- A "blind" approach — sequence first, interpret later
- NCERT: "later assigning different regions… with functions"
A clean way to hold the contrast: ESTs ask "which sequences are genes?" and look only there, whereas Sequence Annotation says "read everything, then ask what each part does." The EST approach is the one NEET most often tests, usually phrased as "ESTs refers to…" with the correct answer being genes expressed as RNA.
Salient features of the human genome
NCERT section 5.9.1 lists the salient observations drawn from HGP. These numbers are the single most heavily tested part of this subtopic, so they should be memorised exactly as stated — not rounded. The two anchor figures are the genome size and the gene count.
Million base pairs
The human genome contains 3164.7 million bp — about 3.1 billion.
Estimated total genes
Far lower than earlier estimates of 80,000 to 1,40,000 genes.
The average gene consists of about 3000 bases, but sizes vary greatly: the largest known human gene, dystrophin, has 2.4 million bases. One of the most striking findings was that almost all — 99.9 per cent — of nucleotide bases are exactly the same in all humans. The functions are unknown for over 50 per cent of the discovered genes, and less than 2 per cent of the genome codes for proteins. The rest is dominated by repeated sequences.
Figure 2. Key salient-feature numbers at a glance — less than 2% of the genome codes for protein, 99.9% of bases are identical across all people, and gene density ranges from 2968 genes on chromosome 1 down to just 231 on the Y chromosome.
Repetitive sequences and SNPs
Repeated sequences make up a very large portion of the human genome. These are stretches of DNA repeated many times — sometimes hundred to thousand times. NCERT notes they are thought to have no direct coding function, but they shed light on chromosome structure, dynamics and evolution. On the gene-density question, the textbook gives the two extremes explicitly: chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231).
Finally, scientists identified about 1.4 million locations where single-base DNA differences — SNPs (single nucleotide polymorphisms, pronounced 'snips') — occur in humans. NCERT states that this information promises to revolutionise the finding of chromosomal locations for disease-associated sequences, and the tracing of human history. Together with whole-genome sequencing and high-throughput technology, SNPs let researchers move from studying one or a few genes at a time to studying all the genes, transcripts and proteins of a tissue or tumour as interconnected networks.
Almost all — 99.9 per cent — of nucleotide bases are exactly the same in all humans; it is the remaining fraction that makes each individual unique.
NCERT — Salient Features of Human Genome
Worked examples
According to NCERT's salient features of the human genome, what is the total number of base pairs, and how does the estimated gene count compare with earlier estimates?
The human genome contains 3164.7 million base pairs (about 3.1 billion). The total number of genes is estimated at about 30,000, which is much lower than the previous estimates of 80,000 to 1,40,000 genes. The surprise of HGP was that humans turned out to have far fewer genes than expected.
Distinguish between the two methodological approaches of the Human Genome Project.
The first approach identified all the genes that are expressed as RNA — these are the Expressed Sequence Tags (ESTs), a gene-focused strategy. The second was the blind approach of sequencing the whole genome, both coding and non-coding sequence, and only afterwards assigning functions to different regions — a step called Sequence Annotation.
A question states the average human gene is "about 3000 bases" yet the dystrophin gene has 2.4 million bases. Is there a contradiction?
No contradiction. NCERT explicitly says the average gene consists of about 3000 bases, "but sizes vary greatly." The 3000-base figure is a mean; individual genes range widely, and dystrophin is given as the largest known human gene at 2.4 million bases. The average and the extreme are both correct.
Why is HGP called a mega project? Give the cost and storage figures NCERT uses.
HGP aimed to sequence every base in a genome of about 3 × 109 bp. At an early estimate of US $3 per base pair, the total cost came to roughly 9 billion US dollars. The sequence of a single cell, printed at 1000 letters per page and 1000 pages per book, would fill about 3300 books. Its scale, cost, 13-year duration and data burden together justify the label "mega project".
Common confusion & NEET traps
Most errors on this subtopic come from mixing up two near-identical figures, or from confusing the two methodologies. The callouts below isolate the traps NEET sets most often.
Genome size (HGP salient feature)
3164.7 million bp
~3.1 × 10⁹ bp
- The precise NCERT figure for the human genome
- Quoted in the salient-features list, section 5.9.1
- Approximated as "3 × 10⁹ bp" in the goals
Diploid DNA content (packaging section)
6.6 × 10⁹ bp
Diploid mammalian cell
- From the DNA-packaging section, not from HGP
- Haploid content is 3.3 × 10⁹ bp
- Used to calculate the ~2.2 m DNA length