Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some changes on finemapper.py with data parsing, and if unnecessary, i'll only submit the code with bug fix for bgen file #188

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
36 changes: 36 additions & 0 deletions PR_XTF_readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
## Pull Request Title: Enhancements and New Features for Genetic Data Processing
Code only modified for `finemapper.py`
### Summary of Changes

This PR introduces several updates and new features to improve the handling and processing of genetic data. Key changes include:

1. **Support for BGEN to cal LD and save to .npz Conversion to accelate the loading speed compared to .bcor:**

- Implemented functionality for converting BGEN files directly to LD matrices stored in .npz format, aligning with polyFun standards. Use `--geno genofile --ldstore2 $(which ldstore) --cache-dir ./ --cache-format npz` to save in npz format by default.
2. **NPZ File Reading Capability:**

- Added capability to read npz files using `--ld your_npz_prefix`.
3. **PGEN File Support:**
- This may be slower than ldstore2 did!
- Introduced support for using pgen files as input via `--geno pgen_file_prefix`.
- Integrated `finemap_tools` for invoking Plink2 (version must be later than PLINK v2.00a6LM 64-bit Intel, dated 2 Mar 2024) to compute LD, using the command template: `plink2 --r2-unphased square`.
- **Note:** The `--geno` option matches files using prefixes, with bed files having higher priority over pgen to avoid conflicts when both file types are present.
5. **Improvements in LD Matrix Handling:**

- Removed the assertion that LD matrices must not contain any NaN values during reading. Instead, i modified `sync_ld_sumstats` function to exclude SNPs with NA values.
- Addressed cases where an SNP's LD calculation might result in NaN due to unfiltered genotype data or extreme scenarios. Users are advised to consider stricter QC or automatically exclude SNPs with NA in any LD calculation based on the number of SNPs dropped, as indicated by the output.
5. **Enhancements in Summary Statistics (sumstats) Loading:**

- For sumstats files that are bgz compressed and have an associated .tbi file, implemented reading via the `tabix` command-line tool. This approach is particularly efficient for genome-wide sumstats, allowing direct retrieval of data by chromosome, significantly reducing loading times.
- In scenarios where `finemap_tools` is unavailable, the original logic of reading the entire file will be followed.
- Sumstats files organized according to polyFun requirements (columns: SNP, A1, A2, BP, CHR, with SNP in the first column) can be processed using `tabix -s 2 -b 3 -e 3 -c S sumstats_with_bgz_compressed.bgz`.
6. **Integration of `finemap_tools` Package:**

- Included `finemap_tools` for filtering bialleic and ambiguous alleles during sumstats reading.
7. **Code Formatting Updates:**

- Applied code formatting improvements using Black.

### Future Developments

Further development and updates will continue in my own repository and will not be submitted as pull requests to this project.
Loading