I'm trying to run a bioinformatics pipeline using Snakemake on GoogleCloud. The first two steps are:
bbmap
's clumpify on the data.The two rules look like this:
def get_fwd_url(wildcard):
return samples.loc[wildcard, 'fwd'].values[0]
def get_rev_url(wildcard):
return samples.loc[wildcard, 'rev'].values[0]
rule get_reads:
output:
fwd=temp("samples/{sample}/fwd.gz"),
rev=temp("samples/{sample}/rev.gz")
threads: 1
params:
fwd_url=get_fwd_url,
rev_url=get_rev_url
log:
"logs/{sample}.get_reads.log"
benchmark:
"benchmarks/{sample}.get_reads.tsv"
shell:
"""
wget -O {output.fwd} {params.fwd_url};
wget -O {output.rev} {params.rev_url};
"""
rule run_bbmap_clumpify:
input:
raw_fwd=rules.get_reads.output.fwd,
raw_rev=rules.get_reads.output.rev
output:
temp("{sample}.clumped.fq.gz")
threads: 32
resources:
mem_mb=15000
conda:
"../envs/conda_qc_reads.yml"
log:
"logs/{sample}.run_bbmap_clumpify.log"
benchmark:
"benchmarks/{sample}.run_bbmap_clumpify.tsv"
group: "bbtools"
shell:
"""
clumpify.sh -Xmx104g -eoom -da in1={input.raw_fwd} in2={input.raw_rev} out={output} dedupe optical 2>&1 | tee {log}
"""
When I run it locally using snakemake -p dryrun
, it successfully builds the DAG.
rule get_reads:
output: samples/196_SRF/fwd.gz, samples/196_SRF/rev.gz
log: logs/196_SRF.get_reads.log
jobid: 9
benchmark: benchmarks/196_SRF.get_reads.tsv
wildcards: sample=196_SRF
wget -O samples/196_SRF/fwd.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR276/ERR2762138/BNA_AAXOSW_4_1_C7T1BACXX.IND15_clean.fastq.gz;
wget -O samples/196_SRF/rev.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR276/ERR2762138/BNA_AAXOSW_4_2_C7T1BACXX.IND15_clean.fastq.gz;
rule run_bbmap_clumpify:
input: samples/196_SRF/fwd.gz, samples/196_SRF/rev.gz
output: 196_SRF.clumped.fq.gz
log: logs/196_SRF.run_bbmap_clumpify.log
jobid: 8
benchmark: benchmarks/196_SRF.run_bbmap_clumpify.tsv
wildcards: sample=196_SRF
resources: mem_mb=15000
clumpify.sh -Xmx104g -eoom -da in1=samples/196_SRF/fwd.gz in2=samples/196_SRF/rev.gz out=196_SRF.clumped.fq.gz dedupe optical 2>&1 | tee logs/196_SRF.run_bbmap_clumpify.log
I want to leverage Google Cloud to do this analysis, so I set up a GS bucket called temperton-lab-wec-store
, then ran:
snakemake -p --kubernetes \
--use-conda -j 12 \
--default-remote-provider GS \
--default-remote-prefix temperton-lab-wec-store --dryrun
Building the DAG fails because the default remote prefix gets repeated over and over again in the path:
Building DAG of jobs...
MissingInputException in line 25 of snakemake/rules/qc_reads.smk:
Missing input files for rule run_bbmap_clumpify:
temperton-lab-wec-store/temperton-lab-wec-store/samples/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/196_SRF/rev.gz
temperton-lab-wec-store/temperton-lab-wec-store/samples/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/196_SRF/fwd.gz
I presume I am doing something wrong in either setting the name in the output in the rules, or setting the --default-remote-prefix
flag. However, I can't find anything in the documentation that indicates how I might fix it.
Any ideas?
I found here 1 an example where the Snakefile is modified to use the reference and FASTQ files from Google Cloud Storage.
Take a look a the example and let me know if trying that fixes the problem.
Here 2 is another example to use the flag --default-remote-prefix, I see nothing wrong in the way you're running the snakemake command but some settings in the rules file might be the problem.
I surmise this is a case where wildcard {sample}
matches more than it should (see https://groups.google.com/forum/#!msg/snakemake/wVlJW9X-9EU/gSZh4U0_CQAJ, although the error is different).
Assuming you don't need to interpret the values in {sample}
as regular expression, add before the first rule (rule all
or however you called it) something like:
wildcard_constraints:
sample= '|'.join([re.escape(x) for x in SAMPLES]), # where SAMPLES is your list of samples
rule all:
etc...