How can I remove (non-trivial) duplicates from a VCF file?Are duplicate variants against the VCF standard?How...
Why is working on the same position for more than 15 years not a red flag?
Do authors have to be politically correct in article-writing?
What's a good word to describe a public place that looks like it wouldn't be rough?
Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?
How should I handle players who ignore the session zero agreement?
Line of Bones to Travel and Conform to Curve (Like Train on a Track, Snake...)
Is it possible to grant users sftp access without shell access? If yes, how is it implemented?
Why did Luke use his left hand to shoot?
Dilemma of explaining to interviewer that he is the reason for declining second interview
How to play electric guitar and bass as a duet
How to use Mathematica to do a complex integrate with poles in real axis?
How do you voice extended chords?
Is Krishna the only avatar among dashavatara who had more than one wife?
How does Leonard in "Memento" remember reading and writing?
Why avoid shared user accounts?
How do I draw the dashed lines as shown in this figure
In mixed effect models, how account for grouped random effects?
Move fast ...... Or you will lose
Increment each digit in a number to form a new number
Why exactly do action photographers need high fps burst cameras?
Citing paywalled articles accessed via illegal web sharing
What to look for when criticizing poetry?
Which communication protocol is used in AdLib sound card?
Why would space fleets be aligned?
How can I remove (non-trivial) duplicates from a VCF file?
Are duplicate variants against the VCF standard?How to read structural variant VCF?How do I carry out an ancestry/admixture test on a single VCF file?How can I extract only insertions from a VCF file?How to manipulate a reference FASTA or bam to include variants from a VCF?How to represent a deletion at position 1 in a VCF file?Where can I get the population allele frequency vcf file?How to subset samples from a VCF file?Meaning of the FORMAT fields of the VCF file coming from GIAB projectHigh Phred Quality score VCF fileAre duplicate variants against the VCF standard?
$begingroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
This is further complicated by cases like this one:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1
Unfortunatley, this one won't be caught by the approach in Daniel's clever script:
$ cat test2.vcf | foo.py
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1
vcf variation
$endgroup$
add a comment |
$begingroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
This is further complicated by cases like this one:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1
Unfortunatley, this one won't be caught by the approach in Daniel's clever script:
$ cat test2.vcf | foo.py
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1
vcf variation
$endgroup$
add a comment |
$begingroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
This is further complicated by cases like this one:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1
Unfortunatley, this one won't be caught by the approach in Daniel's clever script:
$ cat test2.vcf | foo.py
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1
vcf variation
$endgroup$
This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.
Is there any tool that can detect such duplicates and remove them? I tried vcfuniq
from vcflib
, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:
$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1
However, as explained in the linked question, EBI's vcf_validator
considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.
This is further complicated by cases like this one:
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1
Unfortunatley, this one won't be caught by the approach in Daniel's clever script:
$ cat test2.vcf | foo.py
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1
vcf variation
vcf variation
edited 17 mins ago
terdon
asked 2 hours ago
terdon♦terdon
4,3701729
4,3701729
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm
to left-align and split multi-allelic variants, and then again t remove the duplicates:
$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
bcftools norm -d none -f hg19.fa
Lines total/split/realigned/skipped: 2/1/0/0
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 529514 CAC T 0 . . GT 0/1
chr12 529516 529514 C T 0 . . GT 0/0
Lines total/split/realigned/skipped: 3/0/2/0
$endgroup$
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago
$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago
$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago
$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm
to left-align and split multi-allelic variants, and then again t remove the duplicates:
$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
bcftools norm -d none -f hg19.fa
Lines total/split/realigned/skipped: 2/1/0/0
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 529514 CAC T 0 . . GT 0/1
chr12 529516 529514 C T 0 . . GT 0/0
Lines total/split/realigned/skipped: 3/0/2/0
$endgroup$
add a comment |
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm
to left-align and split multi-allelic variants, and then again t remove the duplicates:
$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
bcftools norm -d none -f hg19.fa
Lines total/split/realigned/skipped: 2/1/0/0
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 529514 CAC T 0 . . GT 0/1
chr12 529516 529514 C T 0 . . GT 0/0
Lines total/split/realigned/skipped: 3/0/2/0
$endgroup$
add a comment |
$begingroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm
to left-align and split multi-allelic variants, and then again t remove the duplicates:
$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
bcftools norm -d none -f hg19.fa
Lines total/split/realigned/skipped: 2/1/0/0
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 529514 CAC T 0 . . GT 0/1
chr12 529516 529514 C T 0 . . GT 0/0
Lines total/split/realigned/skipped: 3/0/2/0
$endgroup$
It turns out that bcftools
can do this (tested on bcftools-1.8), if you give it the reference genome to test against:
$ bcftools norm -d none -f hg19.fa test.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 . C T . PASS . GT 0/1
Lines total/split/realigned/skipped: 2/0/1/0
For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm
to left-align and split multi-allelic variants, and then again t remove the duplicates:
$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
bcftools norm -d none -f hg19.fa
Lines total/split/realigned/skipped: 2/1/0/0
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529516 529514 CAC T 0 . . GT 0/1
chr12 529516 529514 C T 0 . . GT 0/0
Lines total/split/realigned/skipped: 3/0/2/0
edited 18 mins ago
answered 2 hours ago
terdon♦terdon
4,3701729
4,3701729
add a comment |
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago
$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago
$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago
$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago
$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago
$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago
$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago
add a comment |
$begingroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
$endgroup$
I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:
- Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.
- In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.
- In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.
Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.
In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.
The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.
#!/usr/bin/env python3
def canonicalize(instream):
for line in instream:
if not line.startswith('#'):
values = line.split('t')
pos = int(values[1])
ref, alt = values[3:5]
if len(ref) > 1 and len(ref) == len(alt):
# How many bp to trim off the end
for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
if r != a:
revoffset = -1 * n
break
# How many bp to trim off the front
for n, (r, a) in enumerate(zip(ref, alt)):
if r != a:
offset = n
values[1] = str(pos + offset)
values[3] = ref[offset:revoffset]
values[4] = alt[offset:revoffset]
break
line = 't'.join(values)
yield line
if __name__ == '__main__':
import sys
for line in canonicalize(sys.stdin):
print(line, end='')
edited 1 hour ago
answered 2 hours ago
Daniel StandageDaniel Standage
2,333329
2,333329
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago
$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago
$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago
$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago
add a comment |
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced byfreebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago
$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago
$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago
$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago
1
1
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by
freebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.$endgroup$
– terdon♦
2 hours ago
$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by
freebayes
and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.$endgroup$
– terdon♦
2 hours ago
$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago
$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago
$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago
$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago
$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago
$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown