SHMTool
SHMTool receives FASTA sequence files in two categories (CONTROL and CASE, e.g. wild-type and genetically modified) to be compared. One or more files can be uploaded for each category. Within the files, each of which may contain many sequences, identical mutations (i.e. same mutation, same site) will be considered unique and counted only once (analysis of non-unique counts, where every sequence is considered independent, is available separately). Separate files should be submitted for sequences originating from independent sources (e.g. different mice, different B cell clones from one mouse defined by CDR3 sequence or clones of tissue culture cells). A single consensus (germline) sequence must also be designated. The user may also specify a subregion (potentially non-contiguous subset of sites) S of the consensus sequence to be analyzed separately. The complementary subregion S’ (all sites not in S) is also analyzed. The subregion feature will typically be used, for example, to analyze complementary determining regions that form the antigen binding sites separately from the framework regions that position the CDRs in the variable region, or to exclude known polymorphic sites that would overestimate the mutation frequency. Statistics comparing S and S’ are also generated. Contingency tables F are constructed and chi-squared tests [as implemented by the R function prop.test] are applied.
Comparative analysis of somatic hypermutation datasets with SHMTool.
4) Recommended Browser (problems with Internet Explorer)
Note that preprocessing of sequences (e.g., alignment, vector removal) is not performed by SHMTool.
Because the sequencing reaction often leads to sequences of differing lengths, it may be tempting in these circumstances to count mutations up to the end of each sequence, especially if mutations are rare. However, to ensure correctness in the calculation of mutation frequencies, we require each mutated sequence to span the exact length of the consensus. Otherwise, we are likely to underestimate the unique mutation frequency since not all sites would be represented by all sequences.
This does mean that any sequences shorter or longer than the consensus cannot be used.
To achieve this:
Sequences in FASTA formatted files are required by SHMTool. In this format, every sequence is preceded by a line starting with a " >" symbol, followed by the name of the sequence and the sequence itself (without spaces between nucleotides). For example:
>CONSENSUS
CAGGTCCAACTGCAGCAGGCAGCAGCCCTGGGGCTGAGGTTGTGAAGCCTGGTGAAGCTGTCCTGCAAGGCTTCTGGCT
ACACCTTCACCAGTCAAGAACAAGGCTGGGTGAAGCAGAGGCGCGGACGAGGCCTTGAGTGGATTGGAAGGATTGAGCC
TTACAGTGGTGATACTAAGTACAATGAGAAGTTCAAGAACAAGGCCACACTGACTGTAGACAAACCGTCCAGCACAGCC
TACATGCAGCTCAGCAGCCTGACATCTGATTATTGTGCAAGTCAAGA
>6733_D08_MS-3723475_032.ab1
CAGGTCCAACTGCAGCAGGCAGCAGCCCTGGGGCTGAGGTTGTGAAGCCTGGTGAAGCTGTCCTGCAAGGCTTCTGGCT
ACACCTTCACCAGTCAAGAACAAGGCTGGGTGAAGCAGAGGCGCGGACGAGGCCTTGAGTGGATTGGAAGGATTGAGCC
TTACAGTGGTGATACTAAGTACAATGAGAAGTTCAAGAACAAGGCCACACTGACTGTAGACAAACCGTCCAGCACAGCC
TACATGCAGCTCAGCAGCCTGACATCTGATTATTGTGCAAGTCAAGA
Note that if you are using a word-processor (such as Microsoft Word) to edit your FASTA files, these need to be saved in "Plain Text" format.
Recommended Browser (problems with Internet Explorer)
SHMTool provides the opportunity to segregate the sequences in subgroups when they have been obtained from different sources (such as different animals, experiments or clonal subgroups), so they can be aligned separately and saved in different FASTA multi-sequence files. Therefore, it is possible to assign several FASTA files to the same category (CONTROL or CASE) by changing the number of input files in the main page. SHMTool generates independent counts for each FASTA file, before pooling the data within each category and generating the overall comparisons of CONTROL and CASE.
If this option is used, the same mutation found in more than one alignment will be considered as a unique event within each alignment and counted as a unique mutation. On the other hand, the existence or not of multiple alignments assigned to each category (CONTROL or CASE) is not relevant when non-unique or total counts are computed.
Note that you can decide how many independent alignments you want to include in each category (CONTROL or CASE) and that the number of each does not have to be the same for both categories.
You may upload FASTA files from your computer into CONTROL and CASE categories.
A single consensus sequence has to be chosen for whole pair-wise comparison of CONTROL versus CASE.
Use the exact label that designates the consensus sequence in the FASTA multi-sequence files (in the example above, this is CONSENSUS)
SHMTool provides the option to analyze separately one or multiple subregions within the region of interest. This would be especially useful if polymorphisms, structural variants or different domains are localized within the region of study. If you localize these subregions providing their exact position or limits in the box appearing in the main page, SHMTool will generate the output results independently for every fragment of the region. Notice that this is not necessary when the whole region is going to be analyzed.
Now you are ready to run SHMTool. The results will temporally remain on the server for 48 hours to facilitate your acces to them and then will be deleted.
We provide here one example of CONTROL and CASE datasets that can be used to test the main features of SHMTool. There are 3 CONTROL and 4 CASE files in FASTA format, which correspond to Jh2-Jh4 intron data obtained from three wild-type and four Msh6-/- age-matched mice, respectively. These datasets were used in a previously published work (Li et al., 1996) to demonstrate that deficiency on the mismatch repair protein MSH6 is associated with an overall decrease in the mutation frequency, reduced mutations in A:T pairs and increased transitions in C:G pairs. Each file contains a consensus sequence of length 600 bp (named CONSENSUS, Genbank accesion number NT_166318.1, nucleotides 256620185-25620784) and 10-35 actual sequences, some of which may be mutated. The test of this in vivo example using SHMTool allows to confirm the key results reported previously in Li et al., 1996 (note that there are minor quantitative differences with the reported results since the more stringent preprocessing required for SHMTool was not used for the original study) and even provides a more extensive analysis of the data, validating and supporting the use of SHMTool as a webserver for comparative analysis of somatic hypermutation datasets.
example.CASE4.fasta |
As 'Consensus Sequence Label' use: CONSENSUS
SHMTool represents in histograms the number of sequences (frequency in y-axis) that contain X mutations (mutations per sequence in x-axis). In the example below, the majority of the sequences are unmutated in both CONTROL and CASE categories, but many sequences contain from 1 to 20 mutations compared to the CONSENSUS.
|
As shown in the tables below, the raw histogram data may be downloaded in a tab-delimited version and exported to other applications for further analysis or graph representation. The first row indicates the absolute number of mutations detected in a sequence (mutPerSeq). The rest of the rows show the number of sequences, within each FASTA file, that contain the corresponding number of mutations indicated in the first row.
CONTROL
|
|
.... |
CASE
|
|
.... |
The actual list of mutations per sequence may be downloaded in a tab-delimited version. In the example below, 0 mutations are detected in sequences named ‘5311_F02_MS-272791_003.ab1’ from Control group or ‘S409022.AEC-232-M13F.ab1’ from Case group; 1 mutation is detected in sequence named ‘5311_G06_MS-272807_018.ab1’, and so on. Using this option we can easily detect which sequences are mutated and how many mutations they carry.
CONTROL |
CASE |
|
|
|
|
number of mutations |
number of trials |
CONTROL |
x |
LCT-x |
CASE |
y |
LCA-y |
The first column of F simply consists of the number of CONTROL and CASE mutations respectively (e.g., if there are x wild-type and y transgenic mutations, then F1,1=x and F2,1=y). For the second column, the “number of trials” (the test assumes a binomial distribution), L is assigned for both CONTROL (LCT) and CASE (LCA). We consider it correct to set L to the theoretical maximum number of mutations, i.e. the number of (G, C, A or T) sites multiplied by the number of groups (files) in the category. For example, if the consensus sequence contains N C-sites, and there are 3 wild-type and 4 transgenic groups respectively (uploaded as separate files of sequences), then LCT =3N and LCA=4N, and F1,2=(3N)-x and F2,2=(4N)-y.
With the contingency table assigned, a chi-squared test [as implemented by the R function prop.test (Bates et al., 1996)] is applied.
MUT: type of substitution (sum: compiled mutations; Tv: transversions; Ts: transitions; GC: mutations at C or G sites; AT: mutations at A or T sites; ALL: overall mutations) CT: number of accumulated mutations in CONTROL category. All unique mutations from the different files that constitute the category are pooled together. CA: number of accumulated mutations in CASE category. All unique mutations from the different files that constitute the category are pooled together. SITES: number of potential sites in the consensus sequences that can present an specific mutation. NCT: number of CONTROL trials. We consider it correct to set NCT to the theoretical maximum number of mutations, i.e. the number of SITES multiplied by the number of groups (files) in the CONTROL category. NCA: number of CASE trials. We consider it correct to set NCA to the theoretical maximum number of mutations, i.e. the number of SITES multiplied by the number of groups (files) in the CASE category. PR_CT: mutation frequency in the CONTROL category. It includes correction by base composition since it is calculated dividing CT by NCT. PR_CA: mutation frequency in the CASE category. It includes correction by base composition since it is calculated dividing CA by NCA. P: p-value when χ2 test [as implemented by the R function prop.test (Bates et al., 1996)] is applied to 2x2 contigency tables as described above. NA: is shown if minimal conditions for statistical test reliability are not met, and therefore a P value cannot be given. |
|
|
MUT: type of substitution (sum: compiled mutations; Tv: transversions; Ts: transitions; GC: mutations at C or G sites; AT: mutations at A or T sites; ALL: overall mutations) CT: number of accumulated mutations in CONTROL category. All unique mutations from the different files that constitute the category are pooled together. CA: number of accumulated mutations in CASE category. All unique mutations from the different files that constitute the category are pooled together. SITES: number of potential sites in the consensus sequences that can present an specific mutation. NCT: number of CONTROL trials. We consider it correct to set NCT to the theoretical maximum number of mutations, i.e. the number of SITES multiplied by the number of groups (files) in the CONTROL category. NCA: number of CASE trials. We consider it correct to set NCA to the theoretical maximum number of mutations, i.e. the number of SITES multiplied by the number of groups (files) in the CASE category. PR_CT: mutation frequency in the CONTROL category. It includes correction by base composition since it is calculated dividing CT by NCT. PR_CA: mutation frequency in the CASE category. It includes correction by base composition since it is calculated dividing CA by NCA. P: p-value when χ2 test [as implemented by the R function prop.test (Bates et al., 1996) is applied to 2x2 contigency tables as described above. NA: is shown if minimal conditions for statistical test reliability are not met, and therefore a P value cannot be given. |
|
|
The user may also specify a subregion (potentially non-contiguous subset of sites) S of the consensus sequence to be analyzed separately. The complementary subregion S’ (all sites not in S) is also analyzed.
All the statistics for unique and non-unique mutations comparing S and S’ are also generated.
![]() |
The subregion feature will typically be used, for example, to analyze complementary determining regions that form the antigen binding sites separately from the framework regions that position the CDRs in the variable region, or to exclude known polymorphic sites that are not mutations and would lead to an overestimate of the mutation frequency.
Remarkably, besides the basic analysis where S (or S', independently) is compared between CASE and CONTROL, SHMTool provides a detailed comparative analysis of S versus S' within each category.
Note that when performing subregion/non-subregion analysis for motifs, it is the position of the mutable base (e.g. the "C" in a WRC motif) that is considered within or outside the subregion. Thus, even if part of the motif lies outside the subregion, it will be counted as lying inside if the mutable base is within the subregion.
S and S' regions are compared within CONTROL category
|
S and S' regions are compared within CASE category
|
![]() |
![]() |
_SUB: refers to S region (subregion), which is defined by the user during the uploading process.
_NON: refers to S' region (non-subregion), which is automatically deduced by SHMTool as the complementary to S region. |
Both unique and non-unique mutations are submitted to this kind of subregion analysis.
|
|
Microsoft Internet Explorer causes two inconveniences, both associated with use of the browser "Back" button.
We therefore recommend using either Mozilla Firefox or Safari web browsers.