How can I remove (non-trivial) duplicates from a VCF file?Are duplicate variants against the VCF standard?How...

Why is working on the same position for more than 15 years not a red flag?

Do authors have to be politically correct in article-writing?

What's a good word to describe a public place that looks like it wouldn't be rough?

Is it a fallacy if someone claims they need an explanation for every word of your argument to the point where they don't understand common terms?

How should I handle players who ignore the session zero agreement?

Line of Bones to Travel and Conform to Curve (Like Train on a Track, Snake...)

Is it possible to grant users sftp access without shell access? If yes, how is it implemented?

Why did Luke use his left hand to shoot?

Dilemma of explaining to interviewer that he is the reason for declining second interview

How to play electric guitar and bass as a duet

How to use Mathematica to do a complex integrate with poles in real axis?

How do you voice extended chords?

Is Krishna the only avatar among dashavatara who had more than one wife?

How does Leonard in "Memento" remember reading and writing?

Why avoid shared user accounts?

How do I draw the dashed lines as shown in this figure

In mixed effect models, how account for grouped random effects?

Move fast ...... Or you will lose

Increment each digit in a number to form a new number

Why exactly do action photographers need high fps burst cameras?

Citing paywalled articles accessed via illegal web sharing

What to look for when criticizing poetry?

Which communication protocol is used in AdLib sound card?

Why would space fleets be aligned?



How can I remove (non-trivial) duplicates from a VCF file?


Are duplicate variants against the VCF standard?How to read structural variant VCF?How do I carry out an ancestry/admixture test on a single VCF file?How can I extract only insertions from a VCF file?How to manipulate a reference FASTA or bam to include variants from a VCF?How to represent a deletion at position 1 in a VCF file?Where can I get the population allele frequency vcf file?How to subset samples from a VCF file?Meaning of the FORMAT fields of the VCF file coming from GIAB projectHigh Phred Quality score VCF fileAre duplicate variants against the VCF standard?













3












$begingroup$


This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:



##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1


These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.



Is there any tool that can detect such duplicates and remove them? I tried vcfuniq from vcflib, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:



$ ./bin/vcfuniq test.vcf
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 . AACAC AATAC . PASS . GT 0/1
chr12 529516 . C T . PASS . GT 0/1


However, as explained in the linked question, EBI's vcf_validator considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.





This is further complicated by cases like this one:



##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1


Unfortunatley, this one won't be caught by the approach in Daniel's clever script:



$ cat test2.vcf | foo.py
##fileformat=VCFv4.1
##reference=foo
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr12>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
chr12 529516 529516 C T . PASS . GT 0/1









share|improve this question











$endgroup$

















    3












    $begingroup$


    This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:



    ##fileformat=VCFv4.1
    ##reference=foo
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##contig=<ID=chr12>
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
    chr12 529514 . AACAC AATAC . PASS . GT 0/1
    chr12 529516 . C T . PASS . GT 0/1


    These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.



    Is there any tool that can detect such duplicates and remove them? I tried vcfuniq from vcflib, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:



    $ ./bin/vcfuniq test.vcf
    ##fileformat=VCFv4.1
    ##reference=foo
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##contig=<ID=chr12>
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
    chr12 529514 . AACAC AATAC . PASS . GT 0/1
    chr12 529516 . C T . PASS . GT 0/1


    However, as explained in the linked question, EBI's vcf_validator considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.





    This is further complicated by cases like this one:



    ##fileformat=VCFv4.1
    ##reference=foo
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##contig=<ID=chr12>
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
    chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
    chr12 529516 529516 C T . PASS . GT 0/1


    Unfortunatley, this one won't be caught by the approach in Daniel's clever script:



    $ cat test2.vcf | foo.py
    ##fileformat=VCFv4.1
    ##reference=foo
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##contig=<ID=chr12>
    #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
    chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
    chr12 529516 529516 C T . PASS . GT 0/1









    share|improve this question











    $endgroup$















      3












      3








      3





      $begingroup$


      This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:



      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 . AACAC AATAC . PASS . GT 0/1
      chr12 529516 . C T . PASS . GT 0/1


      These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.



      Is there any tool that can detect such duplicates and remove them? I tried vcfuniq from vcflib, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:



      $ ./bin/vcfuniq test.vcf
      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 . AACAC AATAC . PASS . GT 0/1
      chr12 529516 . C T . PASS . GT 0/1


      However, as explained in the linked question, EBI's vcf_validator considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.





      This is further complicated by cases like this one:



      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
      chr12 529516 529516 C T . PASS . GT 0/1


      Unfortunatley, this one won't be caught by the approach in Daniel's clever script:



      $ cat test2.vcf | foo.py
      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
      chr12 529516 529516 C T . PASS . GT 0/1









      share|improve this question











      $endgroup$




      This is related to the question I asked here. Consider a vcf file that contains duplicate variants, but where the duplicates aren't simply the same thing in the same notation but instead one is a subset of the other. For example:



      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 . AACAC AATAC . PASS . GT 0/1
      chr12 529516 . C T . PASS . GT 0/1


      These the two variants are actually the same. They result in exactly the same genotype. Changing AACAC to AATAC at position 529514 just means change C to T at position 529516.



      Is there any tool that can detect such duplicates and remove them? I tried vcfuniq from vcflib, but that doesn't seem to recognize this as a duplicate. I think it only looks at the 1st 4 fields and only considers duplicates those variants with exactly the same values in the 1st 4 fields:



      $ ./bin/vcfuniq test.vcf
      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 . AACAC AATAC . PASS . GT 0/1
      chr12 529516 . C T . PASS . GT 0/1


      However, as explained in the linked question, EBI's vcf_validator considers this invalid. And it doesn't really make sense to have these duplicates in any case, so is there any way I can detect and remove them? Preferably an existing tool, but I am open to scripting solutions as well.





      This is further complicated by cases like this one:



      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
      chr12 529516 529516 C T . PASS . GT 0/1


      Unfortunatley, this one won't be caught by the approach in Daniel's clever script:



      $ cat test2.vcf | foo.py
      ##fileformat=VCFv4.1
      ##reference=foo
      ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
      ##contig=<ID=chr12>
      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
      chr12 529514 529514 AACAC AAT,AATAC 0.00 . . GT 0/1
      chr12 529516 529516 C T . PASS . GT 0/1






      vcf variation






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 17 mins ago







      terdon

















      asked 2 hours ago









      terdonterdon

      4,3701729




      4,3701729






















          2 Answers
          2






          active

          oldest

          votes


















          2












          $begingroup$

          It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



          $ bcftools norm -d none -f hg19.fa test.vcf
          ##fileformat=VCFv4.1
          ##FILTER=<ID=PASS,Description="All filters passed">
          ##reference=foo
          ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
          ##contig=<ID=chr12>
          ##bcftools_normVersion=1.8+htslib-1.8
          ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
          #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
          chr12 529516 . C T . PASS . GT 0/1
          Lines total/split/realigned/skipped: 2/0/1/0


          For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm to left-align and split multi-allelic variants, and then again t remove the duplicates:



          $ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
          bcftools norm -d none -f hg19.fa
          Lines total/split/realigned/skipped: 2/1/0/0
          ##fileformat=VCFv4.1
          ##FILTER=<ID=PASS,Description="All filters passed">
          ##reference=foo
          ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
          ##contig=<ID=chr12>
          ##bcftools_normVersion=1.8+htslib-1.8
          ##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
          ##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
          #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
          chr12 529516 529514 CAC T 0 . . GT 0/1
          chr12 529516 529514 C T 0 . . GT 0/0
          Lines total/split/realigned/skipped: 3/0/2/0





          share|improve this answer











          $endgroup$





















            1












            $begingroup$

            I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




            • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

            • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

            • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


            Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



            In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



            The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



            #!/usr/bin/env python3


            def canonicalize(instream):
            for line in instream:
            if not line.startswith('#'):
            values = line.split('t')
            pos = int(values[1])
            ref, alt = values[3:5]
            if len(ref) > 1 and len(ref) == len(alt):
            # How many bp to trim off the end
            for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
            if r != a:
            revoffset = -1 * n
            break

            # How many bp to trim off the front
            for n, (r, a) in enumerate(zip(ref, alt)):
            if r != a:
            offset = n
            values[1] = str(pos + offset)
            values[3] = ref[offset:revoffset]
            values[4] = alt[offset:revoffset]
            break
            line = 't'.join(values)
            yield line


            if __name__ == '__main__':
            import sys
            for line in canonicalize(sys.stdin):
            print(line, end='')





            share|improve this answer











            $endgroup$









            • 1




              $begingroup$
              It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
              $endgroup$
              – terdon
              2 hours ago










            • $begingroup$
              Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
              $endgroup$
              – terdon
              18 mins ago












            • $begingroup$
              Dude, that is some gnarly VCF.
              $endgroup$
              – Daniel Standage
              4 mins ago










            • $begingroup$
              Welcome to my world :(
              $endgroup$
              – terdon
              2 mins ago











            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "676"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2












            $begingroup$

            It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



            $ bcftools norm -d none -f hg19.fa test.vcf
            ##fileformat=VCFv4.1
            ##FILTER=<ID=PASS,Description="All filters passed">
            ##reference=foo
            ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
            ##contig=<ID=chr12>
            ##bcftools_normVersion=1.8+htslib-1.8
            ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
            #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
            chr12 529516 . C T . PASS . GT 0/1
            Lines total/split/realigned/skipped: 2/0/1/0


            For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm to left-align and split multi-allelic variants, and then again t remove the duplicates:



            $ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
            bcftools norm -d none -f hg19.fa
            Lines total/split/realigned/skipped: 2/1/0/0
            ##fileformat=VCFv4.1
            ##FILTER=<ID=PASS,Description="All filters passed">
            ##reference=foo
            ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
            ##contig=<ID=chr12>
            ##bcftools_normVersion=1.8+htslib-1.8
            ##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
            ##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
            #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
            chr12 529516 529514 CAC T 0 . . GT 0/1
            chr12 529516 529514 C T 0 . . GT 0/0
            Lines total/split/realigned/skipped: 3/0/2/0





            share|improve this answer











            $endgroup$


















              2












              $begingroup$

              It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



              $ bcftools norm -d none -f hg19.fa test.vcf
              ##fileformat=VCFv4.1
              ##FILTER=<ID=PASS,Description="All filters passed">
              ##reference=foo
              ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
              ##contig=<ID=chr12>
              ##bcftools_normVersion=1.8+htslib-1.8
              ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
              #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
              chr12 529516 . C T . PASS . GT 0/1
              Lines total/split/realigned/skipped: 2/0/1/0


              For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm to left-align and split multi-allelic variants, and then again t remove the duplicates:



              $ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
              bcftools norm -d none -f hg19.fa
              Lines total/split/realigned/skipped: 2/1/0/0
              ##fileformat=VCFv4.1
              ##FILTER=<ID=PASS,Description="All filters passed">
              ##reference=foo
              ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
              ##contig=<ID=chr12>
              ##bcftools_normVersion=1.8+htslib-1.8
              ##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
              ##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
              #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
              chr12 529516 529514 CAC T 0 . . GT 0/1
              chr12 529516 529514 C T 0 . . GT 0/0
              Lines total/split/realigned/skipped: 3/0/2/0





              share|improve this answer











              $endgroup$
















                2












                2








                2





                $begingroup$

                It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



                $ bcftools norm -d none -f hg19.fa test.vcf
                ##fileformat=VCFv4.1
                ##FILTER=<ID=PASS,Description="All filters passed">
                ##reference=foo
                ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
                ##contig=<ID=chr12>
                ##bcftools_normVersion=1.8+htslib-1.8
                ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
                #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
                chr12 529516 . C T . PASS . GT 0/1
                Lines total/split/realigned/skipped: 2/0/1/0


                For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm to left-align and split multi-allelic variants, and then again t remove the duplicates:



                $ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
                bcftools norm -d none -f hg19.fa
                Lines total/split/realigned/skipped: 2/1/0/0
                ##fileformat=VCFv4.1
                ##FILTER=<ID=PASS,Description="All filters passed">
                ##reference=foo
                ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
                ##contig=<ID=chr12>
                ##bcftools_normVersion=1.8+htslib-1.8
                ##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
                ##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
                #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
                chr12 529516 529514 CAC T 0 . . GT 0/1
                chr12 529516 529514 C T 0 . . GT 0/0
                Lines total/split/realigned/skipped: 3/0/2/0





                share|improve this answer











                $endgroup$



                It turns out that bcftools can do this (tested on bcftools-1.8), if you give it the reference genome to test against:



                $ bcftools norm -d none -f hg19.fa test.vcf
                ##fileformat=VCFv4.1
                ##FILTER=<ID=PASS,Description="All filters passed">
                ##reference=foo
                ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
                ##contig=<ID=chr12>
                ##bcftools_normVersion=1.8+htslib-1.8
                ##bcftools_normCommand=norm -d none -f hg19.fa test.vcf; Date=Wed Feb 27 16:08:44 2019
                #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
                chr12 529516 . C T . PASS . GT 0/1
                Lines total/split/realigned/skipped: 2/0/1/0


                For the more complex case of the multi-allelic variant in the second VCF example from the question, you can run it through bcftools twice. Once using norm to left-align and split multi-allelic variants, and then again t remove the duplicates:



                $ bcftools norm -m -any -NO z -O v -o - ~/test2.vcf |
                bcftools norm -d none -f hg19.fa
                Lines total/split/realigned/skipped: 2/1/0/0
                ##fileformat=VCFv4.1
                ##FILTER=<ID=PASS,Description="All filters passed">
                ##reference=foo
                ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
                ##contig=<ID=chr12>
                ##bcftools_normVersion=1.8+htslib-1.8
                ##bcftools_normCommand=norm -m -any -NO z -O v -o - test2.vcf; Date=Wed Feb 27 18:18:32 2019
                ##bcftools_normCommand=norm -d none -f hg19.fa -; Date=Wed Feb 27 18:18:32 2019
                #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
                chr12 529516 529514 CAC T 0 . . GT 0/1
                chr12 529516 529514 C T 0 . . GT 0/0
                Lines total/split/realigned/skipped: 3/0/2/0






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 18 mins ago

























                answered 2 hours ago









                terdonterdon

                4,3701729




                4,3701729























                    1












                    $begingroup$

                    I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




                    • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

                    • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

                    • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


                    Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



                    In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



                    The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



                    #!/usr/bin/env python3


                    def canonicalize(instream):
                    for line in instream:
                    if not line.startswith('#'):
                    values = line.split('t')
                    pos = int(values[1])
                    ref, alt = values[3:5]
                    if len(ref) > 1 and len(ref) == len(alt):
                    # How many bp to trim off the end
                    for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
                    if r != a:
                    revoffset = -1 * n
                    break

                    # How many bp to trim off the front
                    for n, (r, a) in enumerate(zip(ref, alt)):
                    if r != a:
                    offset = n
                    values[1] = str(pos + offset)
                    values[3] = ref[offset:revoffset]
                    values[4] = alt[offset:revoffset]
                    break
                    line = 't'.join(values)
                    yield line


                    if __name__ == '__main__':
                    import sys
                    for line in canonicalize(sys.stdin):
                    print(line, end='')





                    share|improve this answer











                    $endgroup$









                    • 1




                      $begingroup$
                      It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                      $endgroup$
                      – terdon
                      2 hours ago










                    • $begingroup$
                      Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
                      $endgroup$
                      – terdon
                      18 mins ago












                    • $begingroup$
                      Dude, that is some gnarly VCF.
                      $endgroup$
                      – Daniel Standage
                      4 mins ago










                    • $begingroup$
                      Welcome to my world :(
                      $endgroup$
                      – terdon
                      2 mins ago
















                    1












                    $begingroup$

                    I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




                    • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

                    • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

                    • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


                    Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



                    In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



                    The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



                    #!/usr/bin/env python3


                    def canonicalize(instream):
                    for line in instream:
                    if not line.startswith('#'):
                    values = line.split('t')
                    pos = int(values[1])
                    ref, alt = values[3:5]
                    if len(ref) > 1 and len(ref) == len(alt):
                    # How many bp to trim off the end
                    for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
                    if r != a:
                    revoffset = -1 * n
                    break

                    # How many bp to trim off the front
                    for n, (r, a) in enumerate(zip(ref, alt)):
                    if r != a:
                    offset = n
                    values[1] = str(pos + offset)
                    values[3] = ref[offset:revoffset]
                    values[4] = alt[offset:revoffset]
                    break
                    line = 't'.join(values)
                    yield line


                    if __name__ == '__main__':
                    import sys
                    for line in canonicalize(sys.stdin):
                    print(line, end='')





                    share|improve this answer











                    $endgroup$









                    • 1




                      $begingroup$
                      It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                      $endgroup$
                      – terdon
                      2 hours ago










                    • $begingroup$
                      Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
                      $endgroup$
                      – terdon
                      18 mins ago












                    • $begingroup$
                      Dude, that is some gnarly VCF.
                      $endgroup$
                      – Daniel Standage
                      4 mins ago










                    • $begingroup$
                      Welcome to my world :(
                      $endgroup$
                      – terdon
                      2 mins ago














                    1












                    1








                    1





                    $begingroup$

                    I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




                    • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

                    • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

                    • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


                    Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



                    In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



                    The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



                    #!/usr/bin/env python3


                    def canonicalize(instream):
                    for line in instream:
                    if not line.startswith('#'):
                    values = line.split('t')
                    pos = int(values[1])
                    ref, alt = values[3:5]
                    if len(ref) > 1 and len(ref) == len(alt):
                    # How many bp to trim off the end
                    for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
                    if r != a:
                    revoffset = -1 * n
                    break

                    # How many bp to trim off the front
                    for n, (r, a) in enumerate(zip(ref, alt)):
                    if r != a:
                    offset = n
                    values[1] = str(pos + offset)
                    values[3] = ref[offset:revoffset]
                    values[4] = alt[offset:revoffset]
                    break
                    line = 't'.join(values)
                    yield line


                    if __name__ == '__main__':
                    import sys
                    for line in canonicalize(sys.stdin):
                    print(line, end='')





                    share|improve this answer











                    $endgroup$



                    I'm no expert with VCF (few can say they are!) but I have worked a lot with VCF data in the last few years, both tools to consume and produce VCF. I've never seen variants encoded in this fashion, and it seems to be non-canonical. Typically:




                    • Single nucleotide variants (SNVs) are encoded with a single base as the REF allele and a single base as the ALT allele.

                    • In the case of insertions or deletions, the shorter of the REF and ALT alleles will be a single base, the base preceding the inserted/deleted sequence. Thus the first base of the REF and ALT alleles is always the same.

                    • In the rarer case of two or more consecutive substitutions forming a multinucleotide variant (MNV) the REF and ALT alleles will have the same length.


                    Using multi-bp strings of the same length to encode SNVs is unnecessary and, as you've pointed out, problematic. This makes me think its a bug or a "feature" of the variant predictor that produced the VCF.



                    In this case, I'd write a small script that would check for variants where the REF and ALT alleles have the same length. If the base is the same for REF and ALT in any position, drop it, and adjust the position accordingly.



                    The script below will convert these funky SNVs to the canonical representation, and will also work on MNVs. Standard tools should then work to remove the duplicates.



                    #!/usr/bin/env python3


                    def canonicalize(instream):
                    for line in instream:
                    if not line.startswith('#'):
                    values = line.split('t')
                    pos = int(values[1])
                    ref, alt = values[3:5]
                    if len(ref) > 1 and len(ref) == len(alt):
                    # How many bp to trim off the end
                    for n, (r, a) in enumerate(zip(ref[::-1], alt[::-1])):
                    if r != a:
                    revoffset = -1 * n
                    break

                    # How many bp to trim off the front
                    for n, (r, a) in enumerate(zip(ref, alt)):
                    if r != a:
                    offset = n
                    values[1] = str(pos + offset)
                    values[3] = ref[offset:revoffset]
                    values[4] = alt[offset:revoffset]
                    break
                    line = 't'.join(values)
                    yield line


                    if __name__ == '__main__':
                    import sys
                    for line in canonicalize(sys.stdin):
                    print(line, end='')






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 1 hour ago

























                    answered 2 hours ago









                    Daniel StandageDaniel Standage

                    2,333329




                    2,333329








                    • 1




                      $begingroup$
                      It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                      $endgroup$
                      – terdon
                      2 hours ago










                    • $begingroup$
                      Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
                      $endgroup$
                      – terdon
                      18 mins ago












                    • $begingroup$
                      Dude, that is some gnarly VCF.
                      $endgroup$
                      – Daniel Standage
                      4 mins ago










                    • $begingroup$
                      Welcome to my world :(
                      $endgroup$
                      – terdon
                      2 mins ago














                    • 1




                      $begingroup$
                      It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                      $endgroup$
                      – terdon
                      2 hours ago










                    • $begingroup$
                      Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
                      $endgroup$
                      – terdon
                      18 mins ago












                    • $begingroup$
                      Dude, that is some gnarly VCF.
                      $endgroup$
                      – Daniel Standage
                      4 mins ago










                    • $begingroup$
                      Welcome to my world :(
                      $endgroup$
                      – terdon
                      2 mins ago








                    1




                    1




                    $begingroup$
                    It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                    $endgroup$
                    – terdon
                    2 hours ago




                    $begingroup$
                    It is indeed atypical, but this question was prompted because I actually encountered this in the wild. I didn't generate the vcf, it was sent to me, but the header suggests it was produced by freebayes and the merge of two separate files. So this was probably an artefact of the merging. Unfortunately, I need to deal with VCF files that are given to me by clients, so while I can insist that they conform to the standards, this does conform (AFAIK), so I needed a way of fixing it.
                    $endgroup$
                    – terdon
                    2 hours ago












                    $begingroup$
                    Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
                    $endgroup$
                    – terdon
                    18 mins ago






                    $begingroup$
                    Nice one! Unfortunately (and sorry, I should have made this clear) the actual file I had contains multi-allelic variants where only one was a dupe (see updated question). Your approach won't catch those, but bcftools does (see my answer). This works great for singles though.
                    $endgroup$
                    – terdon
                    18 mins ago














                    $begingroup$
                    Dude, that is some gnarly VCF.
                    $endgroup$
                    – Daniel Standage
                    4 mins ago




                    $begingroup$
                    Dude, that is some gnarly VCF.
                    $endgroup$
                    – Daniel Standage
                    4 mins ago












                    $begingroup$
                    Welcome to my world :(
                    $endgroup$
                    – terdon
                    2 mins ago




                    $begingroup$
                    Welcome to my world :(
                    $endgroup$
                    – terdon
                    2 mins ago


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Bioinformatics Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7126%2fhow-can-i-remove-non-trivial-duplicates-from-a-vcf-file%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Benedict Cumberbatch Contingut Inicis Debut professional Premis Filmografia bàsica Premis i...

                    Monticle de plataforma Contingut Est de Nord Amèrica Interpretacions Altres cultures Vegeu...

                    Escacs Janus Enllaços externs Menú de navegacióEscacs JanusJanusschachBrainKing.comChessV