How can I remove (non-trivial) duplicates from a VCF file?Are duplicate variants against the VCF standard?How...

Why is working on the same position for more than 15 years not a red flag?

Do authors have to be politically correct in article-writing?

What's a good word to describe a public place that looks like it wouldn't be rough?

Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?

How should I handle players who ignore the session zero agreement?

Line of Bones to Travel and Conform to Curve (Like Train on a Track, Snake...)

Is it possible to grant users sftp access without shell access? If yes, how is it implemented?

Why did Luke use his left hand to shoot?

Dilemma of explaining to interviewer that he is the reason for declining second interview

How to play electric guitar and bass as a duet

How to use Mathematica to do a complex integrate with poles in real axis?

How do you voice extended chords?

Is Krishna the only avatar among dashavatara who had more than one wife?

How does Leonard in "Memento" remember reading and writing?

Why avoid shared user accounts?

How do I draw the dashed lines as shown in this figure

In mixed effect models, how account for grouped random effects?

Move fast ...... Or you will lose

Increment each digit in a number to form a new number

Why exactly do action photographers need high fps burst cameras?

Citing paywalled articles accessed via illegal web sharing

What to look for when criticizing poetry?

Which communication protocol is used in AdLib sound card?

Why would space fleets be aligned?

How can I remove (non-trivial) duplicates from a VCF file?

Are duplicate variants against the VCF standard?How to read structural variant VCF?How do I carry out an ancestry/admixture test on a single VCF file?How can I extract only insertions from a VCF file?How to manipulate a reference FASTA or bam to include variants from a VCF?How to represent a deletion at position 1 in a VCF file?Where can I get the population allele frequency vcf file?How to subset samples from a VCF file?Meaning of the FORMAT fields of the VCF file coming from GIAB projectHigh Phred Quality score VCF fileAre duplicate variants against the VCF standard?

This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  .   AACAC   AATAC   .   PASS    .   GT  0/1

chr12   529516  .   C   T   .   PASS    .   GT  0/1

These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.

Is there any tool that can detect such duplicates and remove them? I tried vcfuniq from vcflib, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:

$ ./bin/vcfuniq test.vcf

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  .   AACAC   AATAC   .   PASS    .   GT  0/1

chr12   529516  .   C   T   .   PASS    .   GT  0/1

However, as explained in the linked question, EBI's vcf_validator considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.

This is further complicated by cases like this one:

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  529514  AACAC   AAT,AATAC       0.00    .       .     GT    0/1

chr12   529516  529516  C       T       .       PASS    .     GT    0/1

Unfortunatley, this one won't be caught by the approach in Daniel's clever script:

$ cat test2.vcf | foo.py

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  529514  AACAC   AAT,AATAC   0.00    .   .   GT  0/1

chr12   529516  529516  C   T   .   PASS    .   GT  0/1

edited 17 mins ago

asked 2 hours ago

terdon♦

4,3701729

add a comment |

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  .   AACAC   AATAC   .   PASS    .   GT  0/1

chr12   529516  .   C   T   .   PASS    .   GT  0/1

These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.

$ ./bin/vcfuniq test.vcf

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  .   AACAC   AATAC   .   PASS    .   GT  0/1

chr12   529516  .   C   T   .   PASS    .   GT  0/1

This is further complicated by cases like this one:

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  529514  AACAC   AAT,AATAC       0.00    .       .     GT    0/1

chr12   529516  529516  C       T       .       PASS    .     GT    0/1

Unfortunatley, this one won't be caught by the approach in Daniel's clever script:

$ cat test2.vcf | foo.py

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  529514  AACAC   AAT,AATAC   0.00    .   .   GT  0/1

chr12   529516  529516  C   T   .   PASS    .   GT  0/1

edited 17 mins ago

asked 2 hours ago

terdon♦

4,3701729

add a comment |

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  .   AACAC   AATAC   .   PASS    .   GT  0/1

chr12   529516  .   C   T   .   PASS    .   GT  0/1

These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.

$ ./bin/vcfuniq test.vcf

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  .   AACAC   AATAC   .   PASS    .   GT  0/1

chr12   529516  .   C   T   .   PASS    .   GT  0/1

This is further complicated by cases like this one:

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  529514  AACAC   AAT,AATAC       0.00    .       .     GT    0/1

chr12   529516  529516  C       T       .       PASS    .     GT    0/1

Unfortunatley, this one won't be caught by the approach in Daniel's clever script:

$ cat test2.vcf | foo.py

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  529514  AACAC   AAT,AATAC   0.00    .   .   GT  0/1

chr12   529516  529516  C   T   .   PASS    .   GT  0/1

edited 17 mins ago

asked 2 hours ago

terdon♦

4,3701729

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  .   AACAC   AATAC   .   PASS    .   GT  0/1

chr12   529516  .   C   T   .   PASS    .   GT  0/1

These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.

$ ./bin/vcfuniq test.vcf

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  .   AACAC   AATAC   .   PASS    .   GT  0/1

chr12   529516  .   C   T   .   PASS    .   GT  0/1

This is further complicated by cases like this one:

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  529514  AACAC   AAT,AATAC       0.00    .       .     GT    0/1

chr12   529516  529516  C       T       .       PASS    .     GT    0/1

Unfortunatley, this one won't be caught by the approach in Daniel's clever script:

$ cat test2.vcf | foo.py

##fileformat=VCFv4.1

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529514  529514  AACAC   AAT,AATAC   0.00    .   .   GT  0/1

chr12   529516  529516  C   T   .   PASS    .   GT  0/1

vcf variation

edited 17 mins ago

asked 2 hours ago

terdon♦

4,3701729

edited 17 mins ago

asked 2 hours ago

terdon♦

4,3701729

edited 17 mins ago

asked 2 hours ago

terdon♦

4,3701729

asked 2 hours ago

terdon♦

4,3701729

asked 2 hours ago

terdon♦

4,3701729

add a comment |

2 Answers
2

active

oldest

votes

It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:

$ bcftools norm -d none -f hg19.fa test.vcf

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  .       C       T       .       PASS    .       GT      0/1

Lines   total/split/realigned/skipped:  2/0/1/0

For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm to left-align and split multi-allelic variants, and then again t remove the duplicates:

$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |

    bcftools norm -d none -f hg19.fa

Lines   total/split/realigned/skipped:  2/1/0/0

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019

##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  529514  CAC     T       0       .       .       GT      0/1

chr12   529516  529514  C       T       0       .       .       GT      0/0

Lines   total/split/realigned/skipped:  3/0/2/0

edited 18 mins ago

answered 2 hours ago

terdon♦

4,3701729

add a comment |

I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:

Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.

Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.

In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.

The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.

#!/usr/bin/env python3





def canonicalize(instream):

    for line in instream:

        if not line.startswith('#'):

            values = line.split('t')

            pos = int(values[1])

            ref, alt = values[3:5]

            if len(ref) > 1 and len(ref) == len(alt):

                # How many bp to trim off the end

                for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):

                    if r != a:

                        revoffset = -1 * n

                        break



                # How many bp to trim off the front

                for n, (r, a) in enumerate(zip(ref, alt)):

                    if r != a:

                        offset = n

                        values[1] = str(pos + offset)

                        values[3] = ref[offset:revoffset]

                        values[4] = alt[offset:revoffset]

                        break

                line = 't'.join(values)

        yield line





if __name__ == '__main__':

    import sys

    for line in canonicalize(sys.stdin):

        print(line, end='')

edited 1 hour ago

answered 2 hours ago

Daniel Standage

2,333329

1

$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago

$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago

$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago

$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:

$ bcftools norm -d none -f hg19.fa test.vcf

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  .       C       T       .       PASS    .       GT      0/1

Lines   total/split/realigned/skipped:  2/0/1/0

$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |

    bcftools norm -d none -f hg19.fa

Lines   total/split/realigned/skipped:  2/1/0/0

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019

##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  529514  CAC     T       0       .       .       GT      0/1

chr12   529516  529514  C       T       0       .       .       GT      0/0

Lines   total/split/realigned/skipped:  3/0/2/0

edited 18 mins ago

answered 2 hours ago

terdon♦

4,3701729

add a comment |

It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:

$ bcftools norm -d none -f hg19.fa test.vcf

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  .       C       T       .       PASS    .       GT      0/1

Lines   total/split/realigned/skipped:  2/0/1/0

$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |

    bcftools norm -d none -f hg19.fa

Lines   total/split/realigned/skipped:  2/1/0/0

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019

##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  529514  CAC     T       0       .       .       GT      0/1

chr12   529516  529514  C       T       0       .       .       GT      0/0

Lines   total/split/realigned/skipped:  3/0/2/0

edited 18 mins ago

answered 2 hours ago

terdon♦

4,3701729

add a comment |

It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:

$ bcftools norm -d none -f hg19.fa test.vcf

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  .       C       T       .       PASS    .       GT      0/1

Lines   total/split/realigned/skipped:  2/0/1/0

$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |

    bcftools norm -d none -f hg19.fa

Lines   total/split/realigned/skipped:  2/1/0/0

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019

##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  529514  CAC     T       0       .       .       GT      0/1

chr12   529516  529514  C       T       0       .       .       GT      0/0

Lines   total/split/realigned/skipped:  3/0/2/0

edited 18 mins ago

answered 2 hours ago

terdon♦

4,3701729

It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:

$ bcftools norm -d none -f hg19.fa test.vcf

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  .       C       T       .       PASS    .       GT      0/1

Lines   total/split/realigned/skipped:  2/0/1/0

$ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |

    bcftools norm -d none -f hg19.fa

Lines   total/split/realigned/skipped:  2/1/0/0

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##reference=foo

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##contig=<ID=chr12>

##bcftools_normVersion=1.8+htslib-1.8

##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019

##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1

chr12   529516  529514  CAC     T       0       .       .       GT      0/1

chr12   529516  529514  C       T       0       .       .       GT      0/0

Lines   total/split/realigned/skipped:  3/0/2/0

edited 18 mins ago

answered 2 hours ago

terdon♦

4,3701729

edited 18 mins ago

answered 2 hours ago

terdon♦

4,3701729

answered 2 hours ago

terdon♦

4,3701729

answered 2 hours ago

terdon♦

4,3701729

add a comment |

Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.

The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.

#!/usr/bin/env python3





def canonicalize(instream):

    for line in instream:

        if not line.startswith('#'):

            values = line.split('t')

            pos = int(values[1])

            ref, alt = values[3:5]

            if len(ref) > 1 and len(ref) == len(alt):

                # How many bp to trim off the end

                for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):

                    if r != a:

                        revoffset = -1 * n

                        break



                # How many bp to trim off the front

                for n, (r, a) in enumerate(zip(ref, alt)):

                    if r != a:

                        offset = n

                        values[1] = str(pos + offset)

                        values[3] = ref[offset:revoffset]

                        values[4] = alt[offset:revoffset]

                        break

                line = 't'.join(values)

        yield line





if __name__ == '__main__':

    import sys

    for line in canonicalize(sys.stdin):

        print(line, end='')

edited 1 hour ago

answered 2 hours ago

Daniel Standage

2,333329

1

$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago

$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago

$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago

$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago

add a comment |

Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.

The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.

#!/usr/bin/env python3





def canonicalize(instream):

    for line in instream:

        if not line.startswith('#'):

            values = line.split('t')

            pos = int(values[1])

            ref, alt = values[3:5]

            if len(ref) > 1 and len(ref) == len(alt):

                # How many bp to trim off the end

                for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):

                    if r != a:

                        revoffset = -1 * n

                        break



                # How many bp to trim off the front

                for n, (r, a) in enumerate(zip(ref, alt)):

                    if r != a:

                        offset = n

                        values[1] = str(pos + offset)

                        values[3] = ref[offset:revoffset]

                        values[4] = alt[offset:revoffset]

                        break

                line = 't'.join(values)

        yield line





if __name__ == '__main__':

    import sys

    for line in canonicalize(sys.stdin):

        print(line, end='')

edited 1 hour ago

answered 2 hours ago

Daniel Standage

2,333329

1

$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago

$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago

$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago

$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago

add a comment |

Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.

The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.

#!/usr/bin/env python3





def canonicalize(instream):

    for line in instream:

        if not line.startswith('#'):

            values = line.split('t')

            pos = int(values[1])

            ref, alt = values[3:5]

            if len(ref) > 1 and len(ref) == len(alt):

                # How many bp to trim off the end

                for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):

                    if r != a:

                        revoffset = -1 * n

                        break



                # How many bp to trim off the front

                for n, (r, a) in enumerate(zip(ref, alt)):

                    if r != a:

                        offset = n

                        values[1] = str(pos + offset)

                        values[3] = ref[offset:revoffset]

                        values[4] = alt[offset:revoffset]

                        break

                line = 't'.join(values)

        yield line





if __name__ == '__main__':

    import sys

    for line in canonicalize(sys.stdin):

        print(line, end='')

edited 1 hour ago

answered 2 hours ago

Daniel Standage

2,333329

Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.

The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.

#!/usr/bin/env python3





def canonicalize(instream):

    for line in instream:

        if not line.startswith('#'):

            values = line.split('t')

            pos = int(values[1])

            ref, alt = values[3:5]

            if len(ref) > 1 and len(ref) == len(alt):

                # How many bp to trim off the end

                for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):

                    if r != a:

                        revoffset = -1 * n

                        break



                # How many bp to trim off the front

                for n, (r, a) in enumerate(zip(ref, alt)):

                    if r != a:

                        offset = n

                        values[1] = str(pos + offset)

                        values[3] = ref[offset:revoffset]

                        values[4] = alt[offset:revoffset]

                        break

                line = 't'.join(values)

        yield line





if __name__ == '__main__':

    import sys

    for line in canonicalize(sys.stdin):

        print(line, end='')

edited 1 hour ago

answered 2 hours ago

Daniel Standage

2,333329

edited 1 hour ago

answered 2 hours ago

Daniel Standage

2,333329

answered 2 hours ago

Daniel Standage

2,333329

answered 2 hours ago

Daniel Standage

2,333329

1

$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago

$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago

$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago

$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago

add a comment |

1

$begingroup$
It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
$endgroup$
– terdon♦
2 hours ago

$begingroup$
Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
$endgroup$
– terdon♦
18 mins ago

$begingroup$
Dude, that is some gnarly VCF.
$endgroup$
– Daniel Standage
4 mins ago

$begingroup$
Welcome to my world :(
$endgroup$
– terdon♦
2 mins ago

It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.

– terdon♦
2 hours ago

Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.

– terdon♦
18 mins ago

Dude, that is some gnarly VCF.

– Daniel Standage
4 mins ago

Welcome to my world :(

– terdon♦
2 mins ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Bioinformatics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Fhyujk