Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variants not inserted from input_variants (VCF) file provided #117

Open
alyssa-ab opened this issue Jun 27, 2024 · 14 comments
Open

Variants not inserted from input_variants (VCF) file provided #117

alyssa-ab opened this issue Jun 27, 2024 · 14 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@alyssa-ab
Copy link

Describe the bug
When trying to simulate desired insertions from VCF file, the insertions are not present in the resulting reads. Golden VCF file is completely empty after successful run (no mutations are added, included the directed insertions). Work being done with @SnehaGummadi

To Reproduce
Using version 4.2.1 main branch
Running on Linux HPC using provided conda environment from install documentation.

Command: neat --log-level DEBUG read-simulator -c testing.yml -o asdf

  • debug log 1719503380.1076086_NEAT 1.log

  • Log never indicates success in "Reading input_VCF: path to vcf"

  • config yaml (full paths were provided for reference and input_variants but have been shortened for privacy)

reference: chr18_smallest.fa
read_len: 101
coverage: 3
error_model: .
avg_seq_error: 0.0
rescale_qualities: .
quality_offset: .
ploidy: 1
input_variants: repeats_testing.vcf
target_bed: .
off_target_scalar: .
discard_bed: .
mutation_model: .
mutation_rate: 0.00
mutation_bed: .
paired_ended: True
fragment_model: .
fragment_mean: 300
fragment_st_dev: 30
produce_bam: .
produce_vcf: True
produce_fastq: True
no_coverage_bias: .
rng_seed: 1
min_mutations: .
overwrite_output: .
  • The run was stopped before covering dataset, it did not fail here
  • The same behavior is seen when running to completion
  • No error message is provided, but the insertions are not simulated

Expected behavior
We're expecting to be able to see the inserted variants (without additional mutations) in the simulated reads. For example, a string of 10 A's might be identifiable in some reads. We also expect to see the insertions from our provided VCF appear in the golden VCF.

Additional context

  • An attempt was made to hardcode the input_VCF path into options.py which resulted in a slew of other errors that can be addressed if needed.
  • This was done in hopes of forcing the program to read in the VCF
  • Based on the code in options.py it seems that the path for self.include_vcf remains None rather than the path to the vcf file which might explain why the log file never displays that the vcf file was being read.
@joshfactorial
Copy link
Collaborator

joshfactorial commented Jun 27, 2024 via email

@joshfactorial
Copy link
Collaborator

Please check the latest release to ensure your issues are resolved. You can reopen this ticket if they persist.

@dani-ture
Copy link

I think the issue might be in your testing.yml config file. The option where you include the input vcf is actually called "include_vcf" and not "input_variants".

@joshfactorial joshfactorial reopened this Jul 2, 2024
@joshfactorial
Copy link
Collaborator

you're right. I must have changed that at some point and forgot to update the configuration file. I will do that!

@joshfactorial
Copy link
Collaborator

It was incorrect in one version of the configuration, and correct in the other.

@joshfactorial
Copy link
Collaborator

@alyssa-ab please try changing the variable in the config to "include_vcf" and see if that works. I'll update the template file.

@alyssa-ab
Copy link
Author

Thanks for your help! We will give this as well as the new version a try and let you know how it goes.

@SnehaGummadi
Copy link

I made the change in the yml file as mentioned. The vcf is being read, but I still get the error message below. The repeats_testing.vcf file contains 1 large insertion.

2024-07-11 10:59:37,930:INFO:neat.read_simulator.utils.vcf_func:Parsing input vcf /{base_dir}/NEAT-chimeric/reference_files/repeats_testing.vcf
2024-07-11 10:59:39,378:INFO:neat.read_simulator.utils.vcf_func:Found 1 variants in input VCF.
2024-07-11 10:59:39,379:INFO:neat.read_simulator.utils.vcf_func:Skipped 0 variants because of multiples at the same location
2024-07-11 10:59:41,318:INFO:neat.read_simulator.runner:Beginning simulation.
2024-07-11 10:59:42,027:INFO:neat.read_simulator.runner:Generating variants for chr18
2024-07-11 10:59:42,786:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 0.00 minutes
2024-07-11 10:59:42,803:INFO:neat.read_simulator.utils.generate_variants:Added 0 mutations to chr18
2024-07-11 10:59:42,804:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...
2024-07-11 11:09:25,575:ERROR:neat:read-simulator failed, see the traceback below
Traceback (most recent call last):
  File "{base_dir}/NEAT-chimeric/neat/cli/cli.py", line 131, in main
    cmd(args)
  File "{base_dir}/NEAT-chimeric/neat/cli/commands/read_simulator.py", line 47, in execute
    read_simulator_runner(arguments.config, arguments.output)
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/runner.py", line 295, in read_simulator_runner
    generate_reads(local_reference,
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/utils/generate_reads.py", line 569, in generate_reads
    if sent_to_chimeric == False:
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/utils/read.py", line 339, in finalize_read_and_write
    self.apply_variants_for_final_output(err_model, mut_model)
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/utils/read.py", line 260, in apply_variants_for_final_output
    self.apply_mutations(list(error_model.quality_scores), mutation_model)
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/utils/read.py", line 223, in apply_mutations
    reference_length = variant_to_apply.get_ref_len()
  File "{base_dir}/NEAT-chimeric/neat/variants/unknown_variant.py", line 62, in get_ref_len
    return len(self.metadata['REF'])
KeyError: 'REF'
ERROR: read-simulator failed, showing the last error
Traceback (most recent call last):
  File "{base_dir}/NEAT-chimeric/neat/cli/cli.py", line 131, in main
    cmd(args)
  File "{base_dir}//NEAT-chimeric/neat/cli/commands/read_simulator.py", line 47, in execute
    read_simulator_runner(arguments.config, arguments.output)
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/runner.py", line 295, in read_simulator_runner
    generate_reads(local_reference,
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/utils/generate_reads.py", line 569, in generate_reads
    if sent_to_chimeric == False:
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/utils/read.py", line 339, in finalize_read_and_write
    self.apply_variants_for_final_output(err_model, mut_model)
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/utils/read.py", line 260, in apply_variants_for_final_output
    self.apply_mutations(list(error_model.quality_scores), mutation_model)
  File "{base_dir}/NEAT-chimeric/neat/read_simulator/utils/read.py", line 223, in apply_mutations
    reference_length = variant_to_apply.get_ref_len()
  File "{base_dir}/NEAT-chimeric/neat/variants/unknown_variant.py", line 62, in get_ref_len
    return len(self.metadata['REF'])
KeyError: 'REF'

I tried having 2 large insertions in the vcf file, but supposedly only 1 of the 2 variants is being detected. Additionally, it failed again with the same error message as above.

2024-07-11 11:15:02,569:INFO:neat.read_simulator.utils.vcf_func:Parsing input vcf /{base_dir}/NEAT-chimeric/reference_files/repeats_testing.vcf
2024-07-11 11:15:05,132:INFO:neat.read_simulator.utils.vcf_func:Found 1 variants in input VCF.
2024-07-11 11:15:05,145:INFO:neat.read_simulator.utils.vcf_func:Skipped 0 variants because of multiples at the same location
2024-07-11 11:15:06,900:INFO:neat.read_simulator.runner:Beginning simulation

@joshfactorial joshfactorial self-assigned this Jul 11, 2024
@joshfactorial joshfactorial added the bug Something isn't working label Jul 11, 2024
@joshfactorial
Copy link
Collaborator

Okay, that looks like a bug with how it is storing the metadata. I will work on this. Did the original vcf have a ref and alt? It looks to me like the code didn't detect the ref properly. Would it be possible to share the insertions so I can try them directly? Or at least like an example line with dummy data. I'm wondering if there's a file format reason for this I didn't take into account.

@SnehaGummadi
Copy link

reference fasta: https://raw.githubusercontent.com/SnehaGummadi/NEAT-chimeric/4.2_dev_chimeric_reads/reference_files/chr18_smallest.fa

input_vcf: https://raw.githubusercontent.com/SnehaGummadi/NEAT-chimeric/4.2_dev_chimeric_reads/reference_files/line_x1_HERVK_x1.vcf

I do want to note that when the REF allele was incorrect for the second insertions, the program threw an error. This vcf file should have the fixed version.

@joshfactorial
Copy link
Collaborator

joshfactorial commented Jul 12, 2024 via email

@joshfactorial
Copy link
Collaborator

All right, so the bug I found was in how it was counting how many variants it found. Once I fixed that and cleared the "reference Mismatch" one, it read two properly.

I'm not sure, however, that this variant will work right with NEAT as is. I didn't really consider variants that were longer than a read length. It may get inserted at least in part in reads that overlap it's start position, but the rest of it will probably not appear anywhere. But this is on our list of things to work on for future development.

@joshfactorial
Copy link
Collaborator

I will push the messaging changes and hopefully get a new PR in the next day or two.

@joshfactorial joshfactorial added the enhancement New feature or request label Jul 13, 2024
@SnehaGummadi
Copy link

Alright, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants