Confused by `--default-remote-prefix` flag in snakemake for kubernetes on Google Cloud

9/6/2019

I'm trying to run a bioinformatics pipeline using Snakemake on GoogleCloud. The first two steps are:

  1. Download the reads from ENA
  2. Run bbmap's clumpify on the data.

The two rules look like this:

def get_fwd_url(wildcard):
    return samples.loc[wildcard, 'fwd'].values[0]

def get_rev_url(wildcard):
    return samples.loc[wildcard, 'rev'].values[0]

rule get_reads:
    output:
        fwd=temp("samples/{sample}/fwd.gz"),
        rev=temp("samples/{sample}/rev.gz")
    threads: 1
    params:
        fwd_url=get_fwd_url,
        rev_url=get_rev_url
    log:
        "logs/{sample}.get_reads.log"
    benchmark:
        "benchmarks/{sample}.get_reads.tsv"
    shell:
        """
        wget -O {output.fwd} {params.fwd_url};
        wget -O {output.rev} {params.rev_url};
        """

rule run_bbmap_clumpify:
    input:
        raw_fwd=rules.get_reads.output.fwd,
        raw_rev=rules.get_reads.output.rev
    output:
        temp("{sample}.clumped.fq.gz")
    threads: 32
    resources:
        mem_mb=15000
    conda:
        "../envs/conda_qc_reads.yml"
    log:
        "logs/{sample}.run_bbmap_clumpify.log"
    benchmark:
        "benchmarks/{sample}.run_bbmap_clumpify.tsv"
    group: "bbtools"
    shell:
        """
            clumpify.sh -Xmx104g -eoom -da in1={input.raw_fwd} in2={input.raw_rev} out={output} dedupe optical 2>&1 | tee {log}
        """

When I run it locally using snakemake -p dryrun, it successfully builds the DAG.

rule get_reads:
    output: samples/196_SRF/fwd.gz, samples/196_SRF/rev.gz
    log: logs/196_SRF.get_reads.log
    jobid: 9
    benchmark: benchmarks/196_SRF.get_reads.tsv
    wildcards: sample=196_SRF


        wget -O samples/196_SRF/fwd.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR276/ERR2762138/BNA_AAXOSW_4_1_C7T1BACXX.IND15_clean.fastq.gz;
        wget -O samples/196_SRF/rev.gz ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR276/ERR2762138/BNA_AAXOSW_4_2_C7T1BACXX.IND15_clean.fastq.gz;

rule run_bbmap_clumpify:
    input: samples/196_SRF/fwd.gz, samples/196_SRF/rev.gz
    output: 196_SRF.clumped.fq.gz
    log: logs/196_SRF.run_bbmap_clumpify.log
    jobid: 8
    benchmark: benchmarks/196_SRF.run_bbmap_clumpify.tsv
    wildcards: sample=196_SRF
    resources: mem_mb=15000


            clumpify.sh -Xmx104g -eoom -da in1=samples/196_SRF/fwd.gz in2=samples/196_SRF/rev.gz out=196_SRF.clumped.fq.gz dedupe optical 2>&1 | tee logs/196_SRF.run_bbmap_clumpify.log

I want to leverage Google Cloud to do this analysis, so I set up a GS bucket called temperton-lab-wec-store, then ran:

snakemake -p --kubernetes \
--use-conda -j 12 \
--default-remote-provider GS \
--default-remote-prefix temperton-lab-wec-store --dryrun

Building the DAG fails because the default remote prefix gets repeated over and over again in the path:

Building DAG of jobs...
MissingInputException in line 25 of snakemake/rules/qc_reads.smk:
Missing input files for rule run_bbmap_clumpify:
temperton-lab-wec-store/temperton-lab-wec-store/samples/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/196_SRF/rev.gz
temperton-lab-wec-store/temperton-lab-wec-store/samples/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/temperton-lab-wec-store/196_SRF/fwd.gz

I presume I am doing something wrong in either setting the name in the output in the rules, or setting the --default-remote-prefix flag. However, I can't find anything in the documentation that indicates how I might fix it.

Any ideas?

-- Ben Temperton
google-kubernetes-engine
snakemake

2 Answers

9/9/2019

I found here 1 an example where the Snakefile is modified to use the reference and FASTQ files from Google Cloud Storage.

Take a look a the example and let me know if trying that fixes the problem.

Here 2 is another example to use the flag --default-remote-prefix, I see nothing wrong in the way you're running the snakemake command but some settings in the rules file might be the problem.

Move input files to the cloud (from Google Cloud Storage)

Store output files on the cloud

-- Annie
Source: StackOverflow

9/6/2019

I surmise this is a case where wildcard {sample} matches more than it should (see https://groups.google.com/forum/#!msg/snakemake/wVlJW9X-9EU/gSZh4U0_CQAJ, although the error is different).

Assuming you don't need to interpret the values in {sample} as regular expression, add before the first rule (rule all or however you called it) something like:

wildcard_constraints:
    sample= '|'.join([re.escape(x) for x in SAMPLES]), # where SAMPLES is your list of samples 

rule all:
    etc...
-- dariober
Source: StackOverflow