Using input function with remote files in snakemake

8/7/2019

I want to use a function to read inputs file paths from a dataframe and send them to my snakemake rule. I also have a helper function to select the remote from which to pull the files.

from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
from snakemake.remote.SFTP import RemoteProvider as SFTPRemoteProvider
from os.path import join
import pandas as pd

configfile: "config.yaml"
units = pd.read_csv(config["units"]).set_index(["library", "unit"], drop=False)
TMP= join('data', 'tmp')


def access_remote(local_path):
    """ Connnects to remote as defined in config file"""
    provider = config['provider']
    if provider == 'GS':
        GS = GSRemoteProvider()
        remote_path = GS.remote(join("gs://" + config['bucket'], local_path))
    elif provider == 'SFTP':
        SFTP = SFTPRemoteProvider(
            username=config['user'],
            private_key=config['ssh_key']
        )
        remote_path = SFTP.remote(
            config['host'] + ":22" + join(base_path, local_path)
        )
    else: 
        remote_path = local_path
    return remote_path


def get_fastqs(wc):
    """
    Get fastq files (units) of a particular library - sample 
    combination from the unit sheet.
    """
    fqs = units.loc[
        (units.library == wc.library) & 
        (units.libtype == wc.libtype), 
        "fq1"
    ]
    return {
      "r1": list(map(access_remote, fqs.fq1.values)),
    }

# Combine all fastq files from the same sample / library type combination
rule combine_units:
  input: unpack(get_fastqs)
  output:
    r1 = join(TMP, "reads", "{library}_{libtype}.end1.fq.gz")
  threads: 12
  run:
    shell("cat {i1} > {o1}".format(i1=input['r1'], o1=output['r1']))

My config file contains the bucket name and provider, which are passed to the function. This works as expected when running simply snakemake.

However, I would like to use the kubernetes integration, which requires passing the provider and bucket name in the command line. But when I run:

snakemake -n --kubernetes --default-remote-provider GS --default-remote-prefix bucket-name

I get this error:

ERROR :: MissingInputException in line 19 of Snakefile:
Missing input files for rule combine_units:
bucket-name/['bucket-name/lib1-unit1.end1.fastq.gz', 'bucket-name/lib1-unit2.end1.fastq.gz', 'bucket-name/lib1-unit3.end1.fastq.gz']

The bucket is applied twice (once mapped correctly to each element, and once before the whole list (which gets converted to a string). Did I miss something ? Is there a good way to work around this ?

-- cmdoret
google-cloud-storage
kubernetes
snakemake

0 Answers