Adding modules

This document describes how to add additional post-processing modules to summarise, model or plot the results. As the processing pipeline uses Python these steps are undertaken using Pandas, Polars, plotnine and statsmodels but other frameworks can be used to extend functionality such as Scikit-learn (or any other Python module!).

You should set yourself up with a development environment as described in the contributing section so that linting and pre-commit hooks will run locally, shortening the development feedback loop.

Organising Modules

Typically the functionality that is likely to be added will be some aspect that summarises or plots the post-processing results. To which end there is the isoslam.summary and isoslam.plotting modules to which functionality should be added. There is no harm in adding a new module, particularly if there are a large number of functions that are required but you may wish to consider adding the entry point function to one of these modules.

Parameters

The parameters should be clearly defined with typehints as the pre-commit hooks will fail (if not locally then in Continuous Integration which blocks merging). The nomenclature used for parameters should be the basis of configuration options used (see next section). This makes it possible to leverage **kwargs to pass options loaded from the configuration dictionary and updated from the command line, through to the functions.

Configuration

Configuration options should be added to isoslam/default_config.yaml. A section should be defined for the module you are adding, in this worked example we are creating the plot_conversions and so we would add a section that corresponds to the arguments required for plotting. The function name is the top-level and options for this module are nested within.

plot_conversions:
  group_by: "read"
  theme: "classic"

Validation

There is a validation module in place isoslam.validation which checks that the parameters in the default_config.yaml, a user supplied configuration or command line options are of the expected type. You need to add the options you have added to isoslam/default_config.yaml to the DEFAULT_CONFIG_SCHEMA that is defined in the isoslam.validation module. The examples there should be informative for writing/adding new dictionary entries. The keys are the fields expected in the configuration, the values are the expected types or the schema.Or() function which states the type(s)/values that are permitted and lists an error="<error message>" that is displayed if the condition is not met. For the above additional configuration you would add the following to the DEFAULT_CONFIG_SCHEMA, nesting the options as reflected in the configuration structure.

DEFAULT_CONFIG_SCHEMA = Schema(
    {
        "plot_conversions": {
            "group_by": Or(
                "read",
                "pair",
                error="Invalid value in config for plot.conversions.group_by, valid values are 'read' or 'pair'",
            ),
            "theme": Or(
                "classic",
                "bw",
                error="Invalid value in config for plot.theme, valid values are 'classic' or 'bw'",
            ),
        }
    }
)

Entry Points

To make the module available at the command line, and in turn possible to integrate into the CGAT pipeline with you need what is known in Python packaging as an entry point. This is a method of providing a simple command line interface to access your program and sub-modules so that they do not need prefixing with python -m. The module where this is setup is isoslam.processing where you will see there is a create_config() function which creates an argument parser along with sub-parsers. The new function you are adding will be added as a sub-parser, in the example below we add a sub-parser for plotting the number of conversions per read.

There should be one argument for every configuration option defined in default_config.yaml, which in turn mirrors the options used in the functions that you call, and each dest should match these names. The updating of the configuration based on command line options is contingent on these aligning.

def create_parser() -> arg.ArgumentParser:
    """
    Create a parser for reading options.

    Parser is created with multiple sub-parsers for eading options to run ``isoslam``.

    Returns
    -------
    arg.ArgumentParser
        Argument parser.
    """
    ...

    # Plot conversions per read
    plot_conversions = subparsers.add_parser(
        "plot-conversions-per-read",
        description="Plot the conversions per read.",
        help="Plot the conversions per read.",
    )
    plot_conversions_parser.add_argument(
        "--file-pattern",
        dest="file_pattern",
        type=str,
        required=False,
        default="*_summarized.tsv",
        help="Regular expression for summarized files to process.",
    )
    plot_conversions_parser.add_argument(
        "--outfile",
        dest="outfile",
        type=Path,
        required=False,
        default="conversions.png",
        help="Output filename to save results to, will be nested under 'output_dir'.",
    )
    plot_conversions_parser.add_argument(
        "--separator",
        dest="sep",
        type=str,
        required=False,
        default="\t",
        help="Field separator to use in output file, default is '\t' but other values (e.g. ',' are allowed).",
    )
    plot_conversions.set_defaults(func=plot_conversions)

This sets up the subparser plot_conversions_parser which has three optional arguments to specify the file_pattern to be searched for and subsequently loaded, the outfile name which will be nested under the output_dir (which is an argument to the isoslam entry point) and the separator that is used in the files. Finally a default function is set, in this case plot_conversions.

The plot_conversions() function, which we will define within the processing module does the work of calling the module you have added. The only argument it needs is args which will be the arguments.Namespace that is created by the argument parser. These are used to update the default options which read from the isoslam/default_config.yaml with values the user enters which is the validated with a call to validation.validate_config().

The task of plotting conversions requires that we first load a series of files, combine them and summarise them which are the first set of steps taken, including some subsetting of configuration options. This data is then summarized and plotted using the functions defined and imported from the isoslam.summary module and the isoslam.plotting modules.

from isoslam import plotting as plot
from isoslam import summary, validation


def plot_conversions(args: arg.Namespace | None) -> None:
    """
    Take a set of output files and summarise the number of conversions.

    Counts are made within file, chromosome, transcript, start, end, assignment and whether there is one or more
    conversion observed.

    Parameters
    ----------
    args : arg.Namespace | None
        Arguments function was invoked with.

    Returns
    -------
    None
        Function does not return anything.
    """
    config = io.load_and_update_config(args)
    logger.remove()
    if vars(args)["log_level"] is not None:
        logging.setup(level=vars(args)["log_level"])
    else:
        logging.setup(level=config["log_level"])
    # Validate the configuration
    validation.validate_config(
        config=config,
        schema=validation.DEFAULT_CONFIG_SCHEMA,
        config_type="configuration",
    )
    # Load and summarise the data
    plot_conversions_config = config["plot_conversions"]
    output_config = summary_counts_config.pop("output")
    output_config["output_dir"] = config["output_dir"]
    plot.conversions(**plot_conversions_config)
    logger.info(
        f"Conversions per read plotted to : {output_config['output_dir]}/{output_config['outfile]}"
    )