Why is the Human Genome Project called a mega project?

The human genome has about 3 × 10⁹ base pairs, and at the starting estimate of US $3 per base pair the project cost roughly 9 billion US dollars. Storing one cell's sequence as typed text would fill about 3300 books of 1000 pages each. It ran for 13 years across many countries and demanded high-speed computers, so its scale, cost and duration earn the label mega project.

What is the difference between Expressed Sequence Tags and Sequence Annotation?

Expressed Sequence Tags (ESTs) is the approach that focused only on identifying the genes that are expressed as RNA, ignoring non-coding DNA. Sequence Annotation is the blind whole-genome approach that sequenced the entire genome — all coding and non-coding sequence — and later assigned functions to different regions. ESTs are gene-centred; sequence annotation is genome-wide.

What are SNPs and why are they important?

SNPs (single nucleotide polymorphisms, pronounced 'snips') are single-base DNA differences between people. The Human Genome Project identified about 1.4 million such locations in humans. They promise to revolutionise the finding of chromosomal locations for disease-associated sequences and the tracing of human history.

Which human chromosome has the most genes and which has the fewest?

Chromosome 1 has the most genes, with 2968, while the Y chromosome has the fewest, with 231. The sequence of chromosome 1 was completed only in May 2006 — it was the last of the 24 human chromosomes (22 autosomes plus X and Y) to be sequenced.

What were BAC and YAC used for in the Human Genome Project?

For sequencing, total cell DNA was broken into random smaller fragments and cloned in suitable hosts to amplify each piece. BAC (bacterial artificial chromosome) and YAC (yeast artificial chromosome) were the specialised vectors used — bacteria and yeast served as the cloning hosts. Cloning amplified each fragment so it could be sequenced with ease.

Human Genome Project — NEET Notes

Q: How many base pairs and genes does the human genome contain according to NCERT?

NCERT states the human genome contains 3164.7 million base pairs. The total number of genes is estimated at about 30,000 — far lower than earlier estimates of 80,000 to 1,40,000 genes. The average gene consists of about 3000 bases, and the largest known human gene, dystrophin, has 2.4 million bases.

NCERT grounding

Section 5.9 of the NCERT Class XII Biology chapter Molecular Basis of Inheritance opens the Human Genome Project (HGP) by recalling a single premise from the earlier sections: it is the sequence of bases in DNA that determines the genetic information of an organism. If two individuals differ, their DNA sequences must differ somewhere. NCERT states that this assumption — combined with genetic engineering techniques to isolate and clone any piece of DNA, and fast methods to read DNA sequences — led to the launch of "a very ambitious project of sequencing human genome" in the year 1990.

The textbook explicitly calls HGP a mega project and lists its goals, its 13-year duration, the coordinating bodies, the two methodologies, the salient features of the genome, and its applications. Every figure quoted on this page — 3164.7 million base pairs, ~30,000 genes, 99.9% identity, 1.4 million SNP locations — is taken verbatim from NCERT section 5.9.1, so none of it should be rounded or rephrased in an exam answer.

"Human genome is said to have approximately 3 × 10⁹ bp, and if the cost of sequencing required is US $3 per bp… the total estimated cost of the project would be approximately 9 billion US dollars."
— NCERT, Molecular Basis of Inheritance, Section 5.9

Inside the Human Genome Project

The Human Genome Project was conceived as an attempt to sequence every base in the human genome. NCERT frames it as a mega project for three concrete reasons: its scale, its cost, and its data burden. The human genome contains approximately 3 × 10⁹ base pairs. At the early estimated cost of US $3 per base pair, the total projected cost came to roughly 9 billion US dollars. If the obtained sequence were printed in books — 1000 letters per page, 1000 pages per book — about 3300 books would be needed just to store the DNA sequence of a single human cell. That volume of data made high-speed computational devices for storage, retrieval and analysis unavoidable, and so HGP was closely associated with the rapid growth of a new field, bioinformatics.

3300

Books to store one cell's sequence

With 1000 letters per page and 1000 pages per book, the DNA sequence from a single human cell would fill about 3300 such books — NCERT's illustration of why HGP demanded bioinformatics.

Launch, coordination and timeline

HGP was a 13-year project launched in 1990 and completed in 2003. It was coordinated by the U.S. Department of Energy and the National Institute of Health. During the early years the Wellcome Trust (U.K.) became a major partner, and additional contributions came from Japan, France, Germany, China and others. The project was not limited to humans: many non-human model organisms — bacteria, yeast, Caenorhabditis elegans (a free-living non-pathogenic nematode), Drosophila (the fruit fly), and plants such as rice and Arabidopsis — were also sequenced.

HGP timeline — launch to last chromosome

NCERT Section 5.9

1990
Project launched

Sequencing of the human genome begins as a mega project.
Start
13 years
Coordinated effort

U.S. DoE and NIH coordinate; Wellcome Trust and others join.
Multi-nation
2003
Project completed

Sequencing essentially finished after 13 years.
Completed
May 2006
Chromosome 1 done

Last of the 24 chromosomes (22 autosomes + X, Y) sequenced.
Final piece

One detail worth noting carefully: although the project was completed in 2003, the sequence of chromosome 1 was finished only in May 2006. NCERT records that chromosome 1 was the last of the 24 human chromosomes — 22 autosomes plus X and Y — to be sequenced. Students often assume "completed in 2003" means every chromosome was done by then; the chromosome 1 detail shows the finishing touches extended beyond the headline year.

Goals of HGP

NCERT lists six important goals. They are best remembered as a grid rather than a prose list, because NEET can probe any single item — especially the gene-count goal and the ELSI goal.

Exam note: the goals quote a gene target of 20,000–25,000 genes; the salient feature later in the chapter states the observed estimate as ~30,000 genes. NCERT uses both figures — quote the one the question asks for.

Identify all genes

Identify the approximately 20,000–25,000 genes in human DNA.

Sequence the base pairs

Determine the sequence of the 3 billion base pairs making up human DNA.

Store & analyse data

Store the information in databases and improve tools for data analysis.

Transfer & address ELSI

Transfer technologies to industry; address ethical, legal and social issues (ELSI).

How the genome was actually sequenced

DNA is an extremely long polymer, and there are technical limits to reading very long pieces. So the working protocol was a fragment-and-reassemble approach. Total DNA from a cell was isolated and converted into random fragments of relatively smaller sizes. These fragments were cloned in a suitable host using specialised vectors, which amplified each fragment so it could be sequenced with ease. The commonly used hosts were bacteria and yeast; the vectors were BAC (bacterial artificial chromosomes) and YAC (yeast artificial chromosomes).

The fragments were then read by automated DNA sequencers that worked on the principle of a method developed by Frederick Sanger — the same Sanger credited with developing a method for determining amino acid sequences in proteins. The individual sequences were arranged using overlapping regions present in them, which required deliberately generating overlapping fragments. Aligning these sequences was humanly not possible, so specialised computer programs were developed. The aligned sequences were then annotated and assigned to each chromosome. A further task was assigning genetic and physical maps to the genome, generated using polymorphism of restriction endonuclease recognition sites and repetitive DNA sequences known as microsatellites.

Figure 1

Figure 1. The sequencing workflow — total DNA is broken into random fragments, cloned and amplified in BAC/YAC vectors, read by automated Sanger sequencers, and finally stitched together by computer programs that exploit overlapping regions before each stretch is annotated to a chromosome.

Two sequencing strategies

NCERT states that the methods involved two major approaches. This comparison is a frequent NEET target, so the distinction must be sharp. One approach was gene-focused; the other was genome-wide and "blind".

The first approach focused on identifying all the genes that are expressed as RNA. These were referred to as Expressed Sequence Tags (ESTs). By concentrating only on transcribed sequences, this strategy went straight for the genes and ignored the vast non-coding portion of the genome. The second approach took the "blind" route of simply sequencing the whole set of the genome — all the coding and non-coding sequence — and only later assigning functions to different regions of the sequence. NCERT calls this later step Sequence Annotation.

Expressed Sequence Tags vs Sequence Annotation

Expressed Sequence Tags (ESTs)

Gene-first

Targets expressed DNA

Focuses on identifying all genes expressed as RNA
Ignores the non-coding portion of the genome
Efficient route straight to the functional genes
NCERT: "identifying all the genes that are expressed as RNA"

Sequence Annotation

Genome-first

Blind whole-genome approach

Sequences the whole genome — coding and non-coding
Functions assigned to regions after sequencing
A "blind" approach — sequence first, interpret later
NCERT: "later assigning different regions… with functions"

A clean way to hold the contrast: ESTs ask "which sequences are genes?" and look only there, whereas Sequence Annotation says "read everything, then ask what each part does." The EST approach is the one NEET most often tests, usually phrased as "ESTs refers to…" with the correct answer being genes expressed as RNA.

Salient features of the human genome

NCERT section 5.9.1 lists the salient observations drawn from HGP. These numbers are the single most heavily tested part of this subtopic, so they should be memorised exactly as stated — not rounded. The two anchor figures are the genome size and the gene count.

3164.7

Million base pairs

The human genome contains 3164.7 million bp — about 3.1 billion.

· ~30,000

Estimated total genes

Far lower than earlier estimates of 80,000 to 1,40,000 genes.

The average gene consists of about 3000 bases, but sizes vary greatly: the largest known human gene, dystrophin, has 2.4 million bases. One of the most striking findings was that almost all — 99.9 per cent — of nucleotide bases are exactly the same in all humans. The functions are unknown for over 50 per cent of the discovered genes, and less than 2 per cent of the genome codes for proteins. The rest is dominated by repeated sequences.

Figure 2

Figure 2. Key salient-feature numbers at a glance — less than 2% of the genome codes for protein, 99.9% of bases are identical across all people, and gene density ranges from 2968 genes on chromosome 1 down to just 231 on the Y chromosome.

Repetitive sequences and SNPs

Repeated sequences make up a very large portion of the human genome. These are stretches of DNA repeated many times — sometimes hundred to thousand times. NCERT notes they are thought to have no direct coding function, but they shed light on chromosome structure, dynamics and evolution. On the gene-density question, the textbook gives the two extremes explicitly: chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231).

Finally, scientists identified about 1.4 million locations where single-base DNA differences — SNPs (single nucleotide polymorphisms, pronounced 'snips') — occur in humans. NCERT states that this information promises to revolutionise the finding of chromosomal locations for disease-associated sequences, and the tracing of human history. Together with whole-genome sequencing and high-throughput technology, SNPs let researchers move from studying one or a few genes at a time to studying all the genes, transcripts and proteins of a tissue or tumour as interconnected networks.

Almost all — 99.9 per cent — of nucleotide bases are exactly the same in all humans; it is the remaining fraction that makes each individual unique.

NCERT — Salient Features of Human Genome

Worked examples

Worked example 1

According to NCERT's salient features of the human genome, what is the total number of base pairs, and how does the estimated gene count compare with earlier estimates?

The human genome contains 3164.7 million base pairs (about 3.1 billion). The total number of genes is estimated at about 30,000, which is much lower than the previous estimates of 80,000 to 1,40,000 genes. The surprise of HGP was that humans turned out to have far fewer genes than expected.

Worked example 2

Distinguish between the two methodological approaches of the Human Genome Project.

The first approach identified all the genes that are expressed as RNA — these are the Expressed Sequence Tags (ESTs), a gene-focused strategy. The second was the blind approach of sequencing the whole genome, both coding and non-coding sequence, and only afterwards assigning functions to different regions — a step called Sequence Annotation.

Worked example 3

A question states the average human gene is "about 3000 bases" yet the dystrophin gene has 2.4 million bases. Is there a contradiction?

No contradiction. NCERT explicitly says the average gene consists of about 3000 bases, "but sizes vary greatly." The 3000-base figure is a mean; individual genes range widely, and dystrophin is given as the largest known human gene at 2.4 million bases. The average and the extreme are both correct.

Worked example 4

Why is HGP called a mega project? Give the cost and storage figures NCERT uses.

HGP aimed to sequence every base in a genome of about 3 × 10⁹ bp. At an early estimate of US $3 per base pair, the total cost came to roughly 9 billion US dollars. The sequence of a single cell, printed at 1000 letters per page and 1000 pages per book, would fill about 3300 books. Its scale, cost, 13-year duration and data burden together justify the label "mega project".

Common confusion & NEET traps

Most errors on this subtopic come from mixing up two near-identical figures, or from confusing the two methodologies. The callouts below isolate the traps NEET sets most often.

Two figures students confuse — genome size vs DNA content

Genome size (HGP salient feature)

3164.7 million bp

~3.1 × 10⁹ bp

The precise NCERT figure for the human genome
Quoted in the salient-features list, section 5.9.1
Approximated as "3 × 10⁹ bp" in the goals

Diploid DNA content (packaging section)

6.6 × 10⁹ bp

Diploid mammalian cell

From the DNA-packaging section, not from HGP
Haploid content is 3.3 × 10⁹ bp
Used to calculate the ~2.2 m DNA length

Human Genome Project