Chronic hepatitis C its defective gene interpretation, and designing of primer by using bioinformatics tools
Chronic Hepatitis C Disease
Introduction
Hepatitis C
Hepatitis C is an infection of the liver that results from the Hepatitis C virus. Acute Hepatitis C refers to the first several months after someone is infected. Acute infection can range in severity from a very mild illness with few or no symptoms to a serious condition requiring hospitalization. For reasons that are not known, about 20% of people are able to clear, or get rid of, the virus without treatment in the first 6 months of chronic Hepatitis C:
Chronic Hepatitis C virus
Hepatitis C virus infection is one of the main causes of chronic liver disease worldwide. The long-term natural history of HCV infection is highly variable. The hepatic injury can range from minimal histological changes to extensive fibrosis and cirrhosis with or without hepatocellular carcinoma. There are approximately 71 million chronically infected individuals worldwide, many of whom are unaware of their infection, with important variations according to the geographical area. Clinical care for patients with HCV-related liver disease has advanced considerably during the last two decades, thanks to an enhanced understanding of the pathophysiology of the disease, and because of developments in diagnostic procedures and improvements in therapy and prevention.
Gene
NFE2L2:
This gene encodes a transcription factor
which is a member of a small family of basic leucine zipper (bZIP) proteins.
The encoded transcription factor regulates genes that contain antioxidant
response elements (ARE) in their promoters; many of these genes encode proteins
involved in response to injury and inflammation which includes the production
of free radicals. Multiple transcript variants encoding different isoforms have
been characterized for this gene.
Protein
mutation
The disease is caused by mutations affecting
the gene represented in this entry.
An early onset multisystem disorder
characterized by immunodeficiency, recurrent infections, developmental delay,
poor growth, intellectual disability, and hyperhomocysteinemia. Some patients
manifest congenital cardiac defects. The IMDDHH inheritance pattern is autosomal
dominant.
2.
Identification of Mutations in
nucleotide and protein sequence
·
Normal Protein Sequence
· Mutated Sequence
Normal nucleotide sequence
Mutated sequence
a
A Annotation
|
Disease |
Gene |
Protein |
Associated Proteins |
Mutation |
|
Chronic Hepatitis Disease |
NFE2L2 |
Nuclear
factor erythroid 2-related factor 2 (Nrf2) |
MAFG MAFF MAFK JUN HMOX1 HMOX2 PRKCA KEAP1 CRYZ GSTA2 |
Missense G81S |
CLUSTALW
is a multiple sequence alignment tool that is also used to find out the
sequence similarity between multiple sequences. Input format: FASTA, Pearson,
PIR, EMBL, GDE, and GCG. Output format: CLUSTAL, GDE, Phylip, etc. Pairwise
alignment parameters gap penalty-3, number of top diagonals 5, scoring method
Pearson. Multiple alignment parameters; open gap penalty 10, gap extension
penalty 0.1, weight matrix BLOUSM for protein, IUB for DNA. [4]
Working:
1.
First
I retrieved the sequence of NFE2L2 Gene from the NCBI i.e. FASTA format.
2. Added the normal and mutated sequence of RB1
into the sequence alignment box for alignment. In this tool, you can add
multiple sequences in a one-sequence alignment box
3. Then click on the execute multiple sequence alignment.
In
my case, CLUSTALW didn’t show the desired results. It shows an error because the
normal and mutated sequences both contain the same accession number with only
one nucleotide change. In the CLUSTALWmultiple sequences with different
accession numbers can be added so that one can get the desired results that’s
why I didn’t get the desired results through this tool.
1.1.
TOOL
2
CLUSTAL
Omega is another tool that is used to align multiple sequences and tells
about the sequence similarity. It is also used to check the evolutionary
relationship between the species. This tool can align sequences about 4000
sequences. You can add either the protein, DNA, or RNA sequences to align the
sequences. [4]
Working:
1. First,
retrieved the sequence from the NCBI in FASTA format.
2. Entered
the input sequences and selected the DNA.
3. Set
my parameters with the output format CLUSTALW with character count.
4. Then
it was submitted.
Result
I didn’t obtain desired
results after executing this tool because it requires sequences with different
accession numbers. As I have inserted mutation in the normal nucleotide
sequence, so both sequences have the same accession number. That is why this tool
hadn’t given the desired results,
1.1.
TOOL
3
·
T
COFFEE
T
coffee is a multiple sequence alignment tool that is used to check the
sequence similarity of multiple sequences at a time. T coffee can align the
protein, RNA, and DNA sequences. In T coffee sequence length cannot exceed the character length of 2500 bp.
Input
format: FASTA
Output format: HTML [5].
·
In
the T coffee tool, the input sequence is added in the FASTA format, and in the “sequence to align box” multiple
sequences about which you want to check the sequence similarity and then run
the program.
·
The
alignment shows different colors that indicate the region where the sequence is
best matched or not. The pink color shows a good match, yellow shows an average match, and green shows a bad match or sequence are not matched [5].
· I added my FASTA format of the NFE2L2 gene of normal and mutated sequence in the sequence
alignment box and submitted my work to get the result.
Result:
The
sequence length that I added is 2,988 bp and we cannot add a sequence of more
than 2500 bp in length so I didn’t find the desired results.
1.1.
TOOL
4:
BLAST
stands for basic local alignment search tool and is used to find out the local
sequence similarity between the sequences of same gene or protein or nucleotide
in different species and calculates that how much sequence is matched,
mismatched or gapes are present. This is tool is also used to identify the
evolutionary relationship and help to find the gene families.
·
Input formats: FASTA or GENE BANK and weight matrix.
·
Output formats: HTML, plain text, and XML formatting [6].
1.
Firstly,
add the query sequence about which you want to get information.
2.
Select
“align two or more sequences” option
3.
Add
the sequence in the subject which you want to compare.
4.
After
doing this select highly similar sequence if you want to check the similarity
5.
If
you want to check the dissimilar sequences select discontiguousmegablast and so
on.
6.
Now
click on the BLAST so that system will run the program to find out the results.
7.
The
results tell how much the sequences are similar meaning the percentage identity
how much the total identities are present and what is the query coverage and E
value and how much the gaps matched and mismatched values are present.
8. There are different forms of alignment like pairwise, pairwise with dot identities, Query-anchored with dot identities, etc. You can also draw the dot plot with the help of Blast.
In my case, I added NFE2L2 typical sequence in the query box and mutated the sequence in the subject and selected the highly similar sequences, and clicked on the blast option [8].
The
total length of the sequence is 2988 in which 99 percent are match identities 0
gaps and 1 mismatch the mismatch is at position 241 in the sequence where G is replaced with the A. alignment that I used is pairwise
alignment with dot identities [8].
Description:
·
Maximum score shows the highest alignment of the query sequence with the
subject sequence.
·
E value tells us the background noise. E value describes the number
of hits one can expect to see by chance when searching a database of a particular
size.
·
Percentage identity of the sequence was 99%. A 1% difference is
present between the query and subject sequence.
·
Accession number is a unique identity that is given to biological sequence.
Red
color shows the higher identity and black color shows the lower identity on
top. The horizontal red line shows the query sequence representation.
·
Red Bar shows most similar sequences.
·
Pink Bar shows less good match.
·
Green Bar shows not impressive match.
·
Blue Bar shows worst score.
·
Black Bar shows bad hits.
Similarity sequence is more than 200% which means query sequence is more similar to subject sequence. But similarity is not 100% because one mismatch is present there.
There
are no gaps in the sequence. So the straight line is obtained which shows
maximum similarity.
The School of California St Scratch Cruz (UCSC) Genome Program is a notable Web-based instrument for quickly showing a referenced piece of a genome at any scale, joined by a movement of changed remark "tracks". The clarifications delivered by the UCSC Genome Bioinformatics Social affair and outside partners show quality estimates, mRNA and imparted progression mark courses of action, clear nucleotide polymorphisms, explanation, and authoritative data, total and assortment data, and pairwise and various species relative genomics data. Everything information relevant to a region is presented in one window, working with regular assessment and interpretation. The informational index tables concealed in the Genome Program tracks should be visible, downloaded, and controlled using another Web-based application, the UCSC Table Program. Clients can move the data as custom remark tracks in the two projects for research or educational use. This unit portrays how to include the Genome Program and Table Program for genome assessment, download the fundamental informational collection tables, and make and show custom remark tracks[9].
Moves toward utilizing the genome browsers
Following are the steps involved in the
working of the UCSC genome browser
i.
First open the browser
and click on the genome browser and visualize the genome data.
ii.
Then click on Asia or
other option depending on the location.
iii.
Now select the latest
version and entre position, gene symbol or other search term that you want to
search.
iv.
Suppose I entre the RB1
gene in the human and click on the search button.
v.
The result shows a
graphical summary of the RB1 gene where every steps show the annotation of the
gene.
The result shows the expression of gene,
that RB1 gene is located on the chromosome 2 at q31.2 regions.
This region shows introns and exons and
NFE2L2 contain 5 exons and 4 introns.
The results show the expression of genes in
different tissue and organs. It also shows the high expression of genes in a
specific area. For example, in this case, the high expression occurs in
Esophagus- Mucosa.
The following different colors show the
presence of regulatory elements in the given gene
This result shows the similarity regions. these thick region shows that the gene is present in different organisms. For example, the highest thickness is present in Rh
Thedifferent region indicated by lines are
single Nucleotide markers in this gene. NFE2L2 contain 151 SNPs.
Phylogenetic trees are
constructed to show the evolutionary relationship between different organisms.
The phylogenetic tree may be rooted or unrooted. Rooted tree tells about the
ancestor.
Phylogenetic tree
contains
·
Outs (operational taxonomic units) or nodes such as internal or
hypothetical nodes.
·
Internal or external branch length.
·
clades
·
Tree topology
·
Outgroup (mostly ancestor)
·
Scaled tree (branch length constant, convey no message)
·
Unscaled tree (branching length not constant and its tells about the
information about the evolution period that how much time it takes to evolve)
·
Orthologous (same gene in different organisms)
·
Paralogous (gene duplication in same organism)
I took the NFE2L2 gene,
collected the sequence of this gene in 50 different species, and construct the
phylogenetic tree through MEGA X
software.
·
There are about 20
classes in the phylogentic tree. In clade 1, macaca nemstrina is closely related to theropithecus
gelada, both of these species contain the isoform 2 of NFE2L2 gene.
·
Macaca fascicularis is
related to macaca nemestrina and theropithecus Gelada as compared to
chlorocebus sabaeus.
·
Clade 2 contains two
species i.e pongo Abelii and nomascus leucogenys, which are closely related to
each other. Clade 2 is closely related to clade 1 as compared to other clades.
·
In Clade 3, piliocolobus
tephroscelesbis closely related to Colobus angolensis palliatus and this clade
has close resemblance to clade 2.
·
Clade 4 consists of three
species in which aotus nancymaae and callithrix bacchus have close resemblance.
Clade 4 is closely related to clade 3.
·
Clade 5 contains four
species namely pan paniscus, homo sapiens, gorilla and pan troglodytes. Gorilla
and pan troglodytes have close resemblance, while homo sapiens show similarity
with pan paniscus.
·
Clade 6 consists of only
one specie and this clade is closely related to cllade 5.
·
Clade 7 contains only two species which are closely related and this
clade have resemblance to clade 6.
·
In clade 8, orcinus orca
is closely related to monodon monoceros.
·
Clade 9 is closely
related to clade 10 and both of these clades contain only one species.
·
In clade 11, tursiops
truncatus have a close resemblance to the ropithecus Gelada.
·
Clade 11 is closely
related to clade 12 which contains homo sapiens that has a variant of NFE2L2
gene.
·
Clade 13 contains only
two closely related species and this clade is closely related to clade 12.
·
In clade 14 there are
about four species out of which Acinonyx jubatus is closely to Lnyx canadensis
and felix cactus is least closely related to these two species.
·
In clade 15, Marmota
flaviventris is closely related to urocitellus parryii. In this clade pan
troglodytes is more related to gorilla as compared to marmota flaviventris.
·
Clade 15 is closely
related to clade 16 and 17.
·
In clade 16 macaca
nemstrina is closely related to theropithecus Gelada.
·
In clade 17 the two
species of the genus Rhinopithecus i.e Rhinopithecus roxellana and
Rhinopithecus bieti are closely related to each other.
·
Clade 16 and 17 have
close resemblance to each other.
·
Clade 18 consists of only
one species and this clade have close resemblance to clade 10 and 11.
·
In clade 19, globicephla
melas is closely related to turciops truncatus. Nomascus leucogenys, that
contain an isoform 1 of NFE2L2 gene, have less resemblance to turciops
truncatus and globicephla melas. This clade is closely related to clade 20.
·
Clade 20 comprises of
only two closely related species i.e. Lipotes vexillifer and Muntiacus muntjak.
A primer is a short synthetic
oligonucleotide, which is used in many molecular techniques from PCR to DNA sequencing. These primers are designed to have a sequence,
which is the reverse complement of a region of template, or target DNA to which
we wish the primer to anneal.
Primer 3 Plus is a tool
that picks primers from a DNA sequence.
·
Then I opened the Primer
3 plus tool and pasted my sequence in box given.
·
After that, I clicked on
the pick primers button.
·
After that, I got the
available left and right primers for the selected gene that is NFE2L2.
·
After that, I opened the
sequence manipulation suite to get the best primer from the DNA sequence
depending upon the following conditions:
i.
Length should be from
18-25 base pairs.
ii.
Base composition should
be from 45-55% GC
iii.
Melting Temperature
should be 55-80 degrees Celsius
·
Took the DNA sequence of
NFE2L2 gene from NCBI.
· Inserted the sequence in the box given in Primer 3 Plus tool to get primers of the desired sequence and obtained the following r
i got five primer pairs for my gene sequence.
Ø In
the above sequence, the purple color is showing the left primer and the yellow
one is showing the right primer.
The Sequence Manipulation
Suite is a collection of JavaScript programs for generating, formatting, and
analyzing short DNA and protein sequences. It is commonly used by molecular
biologists, for teaching, and for program and algorithm testing.
·
After getting left and
right primers form primer 3 plus tool, I opened the sequence manipulation suite
to select the best primer for my gene sequence.
·
For this purpose I opened
the PCR primer Stats option given in the list on SMS and I entered the
sequences of all he left primers of my gene sequence.
·
I got the following
results that is the primer with be best properties.
According to the above
result, the left primer of pair five is the best primer because it has:
Primer sequence: CGGTATGCAACAGGACATTG
· Sequence length: 20
· Base counts: G=6; A=6; T=4; C=4; Other=0;
· GC content (%): 50.00
· Molecular weight (Daltons): 6166.08
· nmol/A260: 5.02
· micrograms/A260: 30.94
· Basic Tm (degrees C): 52
· Salt adjusted Tm (degrees C): 47
· Nearest neighbor Tm (degrees C): 62.06
·
· Protein structure
prediction
·
3.7.1.CFSSP[15]
·
CFSSP is used to predict the
secondary structure of a protein [15].
Normal
Interpretation
In
this picture
Ø
red
color shows helix
Ø
green
shows sheets,
Ø
blue
shows turn and
Ø
yellow
shows coils.
2D structure of NFE2l2 gene contains:
Mutated
In
this mutated 2D structure
Ø
red,
Ø
green,
Ø
blue
and
Ø
yellow
color
shows helix, sheets, turns
and coils respectively.
2D structure of the RB1 protein contains, 45
coils, 68 turns, 106 alpha helixes and 100 beta sheets.
Total
no of H residues is 691 in mutated sequence as compared to the normal because
of one mutation but the coils, turns, and sheets are same in both cases but the
alpha helix are different in both in mutated helix is 106 and in normal it is
112.
Swiss model
In
NFE2L2 protein
No
of models is 9 select model with highest sequence similarity in my case
sequence similarity is 100 percent.
Templates
are 50
Oligo
state is monomer
Ligand
is 0 in this case
Swiss
model gives us a 3d model of a protein.
If
a value of a model is greater than 0.5 it is ideal in my case it lies between
0.5 and 1.
GOR TOOL:
Normal
Interpretation
This picture shows the secondary structure of a protein the red shows the alpha helix and the blue shows the beta sheets.
Mutated
Interpretation
This picture shows the secondary structure of a
protein the red shows the alpha helix and the blue shows the beta sheets.
Sequence length: 605
GOR:
Alpha helix (Hh): 205 is 33.88%
310 helix (Gg: 0 is 0.00%
Pi helix (Ii): 0 is 0.00%
Beta bridge (Bb): 0 is 0.00%
Extended strand (Ee): 68 is 11.24%
Beta turn (Tt): 0 is 0.00%
bend region(Ss): 0 is 0.00%
Random coil (Cc: 332 is 54.88%
Ambiguous states (?): 0 is 0.00%
other states: 0 is 0.00%
Modeller
Modeller is a software which is used to
view the best 3D model of a protein or a gene.
select
the model with the lowest molpdf value and then open Tctex file with low molpdf
file into the chimera tool to get the best 3D model of the desired protein in
this case the protein is NFE2L2.
[1] Asma AA Zahidi, J. M. (2017). Chronic Hepatitis C: an
optometrist’s perspective. Clinical optometry, 9, 123-131. Retrieved
October 5, 2019
[2] Rachel J. Watkins, M. G. (2012). The Role of NFE2L2 in
Idiopathic Infantile Nystagmus. Journal of Ophthalmology, 1-7. Retrieved
October 05, 2019
[3] National
Center for Biotechnology Information. ClinVar; [VCV000263089.1],
https://www.ncbi.nlm.nih.gov/clinvar/variation/VCV000263089.1 (accessed Oct. 5,
2019).
[4]
‘Multiple Sequence Alignment - CLUSTALW’. [Online]. Available:
https://www.genome.jp/tools-bin/clustalw. [Accessed: 22-Oct-2019].
[5] ‘Clustal Omega <
Multiple Sequence Alignment < EMBL-EBI’. [Online]. Available:
https://www.ebi.ac.uk/Tools/msa/clustalo/.
[Accessed: 22-Oct-2019].
[6]
‘T-COFFEE Multiple Sequence Alignment Server’. [Online]. Available: http://tcoffee.crg.cat/.
[Accessed: 22-Oct-2019].
[7]
‘BLAST: Basic Local Alignment Search Tool’. [Online]. Available:
https://blast.ncbi.nlm.nih.gov/Blast.cgi. [Accessed: 22-Oct-2019].
[8] ‘NCBI Blast: NM_000321.2
Homo sapiens RB transcriptional’. [Online]. Available:
https://blast.ncbi.nlm.nih.gov/Blast.cgi#Query_56969.
[Accessed: 22-Oct-2019].
[9] D. Karolchik, A. S. Hinrichs, and W. J. Kent, ‘The UCSC
Genome Browser’, in Current Protocols in Bioinformatics, Hoboken, NJ,
USA: John Wiley & Sons, Inc., 2009.
Contents
2. Identification of Mutations in
nucleotide and protein sequence
Regulatory Elements Present in Gene
Similarity with other organisms
7.2. Sequence Manipulation Suite
Great work.
ReplyDeleteVery well explained
ReplyDeleteVery informative
ReplyDelete