src.core.analyzer package

Submodules

src.core.analyzer.adapter_trimmer module

This module contains the AdapterTrimmer class, which manages the removal of adapter sequences from sequencing reads using the Trimmomatic program.

This preprocessing step is crucial for ensuring that primer sequences are positioned close to the read ends, allowing for more accurate downstream analysis.

Classes:

AdapterTrimmer:

Responsible for executing adapter trimming on sequencing reads. Utilizes Trimmomatic, supports both single-end and paired-end modes, and logs the trimming process.

Main Features:

Validates the existence of input read files.
Constructs and executes the Trimmomatic command

with appropriate parameters. - Handles output directories and logging. - Returns paths to the processed, adapter-trimmed read files.

Dependencies:

src.core.base:
Provides logging and command execution utilities.
src.core.sample_data_container:
Defines the SampleDataContainer class.
src.core.analyzer.i_data_preparator:
Interface for data preparator classes.

This class is designed to be integrated into sequencing data pipelines, automating the adapter trimming process with configurable options.

class src.core.analyzer.adapter_trimmer.AdapterTrimmer(configurator)[source]

Bases: LoggerMixin, IDataPreparator

The AdapterTrimmer class is responsible for removing adapter sequences from sequencing reads using the Trimmomatic program. This step is essential to ensure that primer sequences are positioned close to the read ends, enabling more accurate identification.

Inherits from LoggerMixin and IDataPreparator, implementing the data preparation interface.

perform(sample: SampleDataContainer, executor: CommandExecutor | callable) → list[PathLike][source]

Executes adapter sequence trimming on the provided sequencing sample.

Parameters:

sample (SampleDataContainer) – The container holding sample sequencing data, including paths to raw reads.
executor (Union[CommandExecutor, callable]) – The command executor or callable responsible for running system commands.

Returns:

A list of paths to the processed (trimmed) read files, specifically the paired read files.

Return type:

list[PathLike[AnyStr]]

Raises:

FileNotFoundError – If any of the input read files are missing.

_abc_impl = <_abc._abc_data object>

_is_protocol = False

src.core.analyzer.amplicon_coverage_computer module

This module defines the AmpliconCoverageDataPreparator class and associated components for analyzing sequencing data coverage within specified genomic regions.

Modules and Classes: - PositionNotFoundError:

Custom exception raised when a specific genomic position cannot be found in a mpileup file.

AmpliconCoverageDataPreparator:
A class responsible for generating mpileup files for given sample regions, calculating coverage metrics, and analyzing variant coverage at specific positions. It inherits from LoggerMixin for logging capabilities and IDataPreparator to adhere to a data preparation interface.

Key functionalities:

Generating mpileup files for sample regions using samtools.
Counting coverage over regions based on mpileup data.
Counting variant coverage at specific genomic positions.
Performing the entire process of mpileup generation

and coverage calculation. - Managing configuration and executing system commands for data processing.

Usage:

Instantiate the class with a configuration object and a filter function, then call perform() with a sample data container, target regions, and an executor to process coverage analysis for sequencing samples.

Note

Ensure that the configuration file contains correct paths and parameters, especially for ‘samtools’ and ‘bedtools’. The class also relies on the presence of mpileup files and the ability to generate them via command-line tools.

exception src.core.analyzer.amplicon_coverage_computer.PositionNotFoundError[source]

Bases: Exception

Base exception for handling source file positions management process

class src.core.analyzer.amplicon_coverage_computer.AmpliconCoverageDataPreparator(configurator: Configurator, filter_func: callable)[source]

Bases: LoggerMixin, IDataPreparator

The AmpliconCoverageDataPreparator class is designed to generate mpileup files for specified regions of a sequenced sample and calculate coverage metrics within those regions. It also provides methods to count coverage over regions and analyze variant coverage at specific positions.

Inherits from LoggerMixin and IDataPreparator, ensuring logging capabilities and adherence to data preparation interface.

generate_mpileup(sample: SampleDataContainer, executor: CommandExecutor | callable) → list[PathLike][source]

Generates mpileup files for specified regions of the sample.

Parameters:

sample (SampleDataContainer) – The sample containing sequencing data.
executor (callable) – The command executor or function to run system commands.

Returns:

Paths to the generated mpileup files.

Return type:

list of PathLike

count_region_coverage(mpileup: PathLike, chromosome: int | str, start: int | str, end: int | str) → float[source]

Counts the coverage within a specified region from a mpileup file.

Parameters:

mpileup (PathLike) – Path to the mpileup file.
chromosome (int or str) – Chromosome identifier.
start (int or str) – Start position of the region.
end (int or str) – End position of the region.

Returns:

The filtered average coverage within the region.

Return type:

float

static count_indels(data: str) → dict[str, int][source]

Counts the number of insertions and deletions for two replicates (r1 and r2) based on the input data string.

Parameters:

data (str) – A string containing insertion and deletion patterns in the form ‘+<number><bases>’ or ‘-<number><bases>’, where <bases> is a sequence of [ACTGNactgn] characters.

Returns:

key:: An indel signature.
value:: Count of the key in pileup line

Return type:

dict[str, int]

static count_target_char(src: AnyStr, target_char: AnyStr = '*') → int[source]

count_variant_coverage(chromosome: int | str, position: int | str, ref: str, alt: str) → tuple[int, int, float][source]

Calculates coverage information and variant counts at a specific genomic position from mpileup data.

Parameters:

chromosome (int or str) – The chromosome identifier (number or string).
position (int or str) – The genomic position to analyze.
ref (str) – The reference allele at the position.
alt (str) – The alternative allele at the position.

Returns:

depth (int):: The total read depth at the position.
total_alt_count (int):: The total count of reads supporting the alternative allele, including indels.
alt_ratio (float):: The ratio of reads supporting the alternative allele to total depth, rounded to 3 decimal places.

Return type:

tuple

Note

This method searches for the specified position

in a chromosome-specific mpileup file. - It uses memory-mapped file access for efficiency. - It counts reference matches (‘.’ and ‘,’) and mismatches (based on alt allele). - It also calls count_indels() to count insertions and deletions supporting the variant. - Returns (-1, -1, -1) if the position is not found or an error occurs. - Raises FileNotFoundError if the mpileup file for the chromosome does not exist.

perform(sample: SampleDataContainer, executor: CommandExecutor | callable) → list[source]

Executes the process of generating mpileup files for target regions and calculates coverage metrics.

Parameters:

sample (SampleDataContainer) – The sequencing data sample.
executor (callable) – Function or command executor to run system commands.

Returns:

Results containing coverage metrics for each region.

Return type:

list

_abc_impl = <_abc._abc_data object>

_is_protocol = False

src.core.analyzer.annotation_adapter module

This module defines interfaces and implementations for variant annotation adapters, specifically using SnpEff.

It provides a protocol interface for annotation adapters and a concrete implementation that uses SnpEff to annotate variant data in VCF files.

Classes:

IAnnotationAdapter:
Protocol interface for annotation adapters.
SnpEffAnnotationAdapter:
Implements variant annotation using SnpEff.

Main Features:

Annotates VCF files with variant effect predictions.
Generates summary reports in HTML and CSV formats.
Supports integration with command execution frameworks.

class src.core.analyzer.annotation_adapter.IAnnotationAdapter(*args, **kwargs)[source]

Bases: Protocol

Interface for annotation adapters that perform variant annotation on sequencing data.

annotate(sample: SampleDataContainer, reference_ident: str, executor: CommandExecutor | callable) → PathLike[source]

Performs annotation on the given sample’s variant data.

Parameters:

sample (SampleDataContainer) – The sample containing variant data (e.g., VCF file).
reference_ident (str) – Identifier for the reference genome or annotation database.
executor (callable or CommandExecutor) – Function or object to execute system commands.

Returns:

Path to the annotated variant file (e.g., annotated VCF).

Return type:

PathLike

_abc_impl = <_abc._abc_data object>

_is_protocol = True

class src.core.analyzer.annotation_adapter.SnpEffAnnotationAdapter(configurator: Configurator)[source]

Bases: LoggerMixin, IAnnotationAdapter

Implementation of the IAnnotationAdapter interface using SnpEff for variant annotation.

annotate(sample: SampleDataContainer, reference_ident: str, executor: CommandExecutor | callable) → PathLike[source]

Annotates variants in the sample’s VCF file using SnpEff.

Parameters:

sample (SampleDataContainer) – The sample with VCF data to annotate.
reference_ident (str) – The reference genome or database identifier.
executor (callable) – Function or object to execute system commands.

Returns:

Path to the annotated VCF file.

Return type:

PathLike

_abc_impl = <_abc._abc_data object>

_is_protocol = False

src.core.analyzer.bam_grouper module

This module contains the BamGrouper class, which manages the conversion, sorting, and indexing of SAM files into BAM format for sequencing data.

Classes:

BamGrouper:
Converts SAM files to sorted BAM files, adds read group information, and indexes the BAM files using Picard tools.

Main Functionality:

Takes a sample’s SAM file output from mapping.
Uses Picard tools to add read groups, sort the BAM file, and create an index.
Produces space-efficient, indexed BAM files optimized

for downstream analysis and fast interaction.

This class is designed to streamline BAM file preparation steps in sequencing pipelines, improving efficiency and facilitating downstream processing.

class src.core.analyzer.bam_grouper.BamGrouper(configurator: Configurator)[source]

Bases: LoggerMixin, IDataPreparator

The BamGrouper class handles the conversion of SAM files to sorted BAM files, adds read group information, and indexes the BAM files using Picard tools.

BAM files are more space-efficient, faster for interaction, and support indexing.

perform(sample: SampleDataContainer, executor: CommandExecutor | callable) → tuple[PathLike, PathLike][source]

Conversion of the read mapping output on the reference (SAM file) to a BAM file, sorting of reads, addition of read group information, and indexing of the BAM file using Picard.

BAM files occupy less disk space, and due to indexing and their binary format, interaction speed with these files is significantly higher. :param sample: The container with sample’s data,

may be used for naming or metadata.

Parameters:

executor (Union[CommandExecutor, callable]) – The parameter is an external callable object or a special class to handling or/and wrapping system calls.

Returns:

A pair of paths -: (index_path, bam_path) where index_path is the path to the BAM index file (.bai), and bam_path is the path to the sorted BAM file.

Return type:

tuple

_abc_impl = <_abc._abc_data object>

_is_protocol = False

src.core.analyzer.bqsr_performer module

This module contains the BQSRPerformer class, which manages Base Quality Score Recalibration (BQSR) using GATK’s BaseRecalibrator and ApplyBQSR tools.

It performs the following key steps:

Generates a recalibration table with BaseRecalibrator. 2. Applies the recalibration to produce a recalibrated BAM file with ApplyBQSR.

The process enhances variant calling accuracy by adjusting quality scores based on known sites and covariates, improving downstream analyses.

Classes:

BQSRPerformer:
Executes BQSR by running GATK commands, managing logs, and handling input/output files.

Main Features:

Constructs command-line strings for GATK tools.
Executes commands with logging and error handling.
Handles input sample data and target regions.
Renames output files post-processing.

class src.core.analyzer.bqsr_performer.BQSRPerformer(configurator: Configurator)[source]

Bases: LoggerMixin, IDataPreparator

Handles Base Quality Score Recalibration (BQSR) using GATK’s tools.

Performs two main steps:

Generates a recalibration table with BaseRecalibrator.

2. Applies the recalibration with ApplyBQSR to produce a corrected BAM file.

This process improves the accuracy of variant calling by adjusting quality scores based on known sites and covariates.

perform(sample: SampleDataContainer, executor: CommandExecutor | callable) → PathLike[source]

Executes BQSR using GATK’s BaseRecalibrator and ApplyBQSR.

Parameters:

sample (SampleDataContainer) – The sample data to process.
executor (Union[CommandExecutor, callable]) – Function or object to run commands.

Returns:

Path to the recalibrated BAM file.

Return type:

PathLike[AnyStr]

Raises:

Propagates exceptions from command execution –
or file operations. –

_abc_impl = <_abc._abc_data object>

_is_protocol = False

src.core.analyzer.i_data_preparator module

Module containing the IDataPreparator protocol.

This module defines a protocol for data preparator classes . This allows type hinting to ensure that classes using an IDataPreparator will have a perform method with a specific signature.

class src.core.analyzer.i_data_preparator.IDataPreparator(*args, **kwargs)[source]

Bases: Protocol

Protocol interface for data preparator classes.

Defines a method ‘perform’ that all implementing classes must override. This allows type hinting to ensure that classes using an IDataPreparator will have a perform method with a specific signature.

perform(*args, **kwargs) → Any[source]

Performs the data preparation steps.

Raises:: NotImplementedError – If the method is not implemented in a concrete class.

_abc_impl = <_abc._abc_data object>

_is_protocol = True

src.core.analyzer.i_variant_caller module

Module containing protocols for variant calling and data preparation.

This module defines protocols for IDataPreparator and IVariantCaller classes, enforcing a specific method signature for type hinting. This improves code maintainability and readability by ensuring all implementing classes adhere to a consistent interface.

class src.core.analyzer.i_variant_caller.IVariantCaller(*args, **kwargs)[source]

Bases: Protocol

Protocol interface for variant caller classes. Defines the ‘call_variant’ method that all implementing classes must override, responsible for performing variant calling on a given sample.

call_variant(sample: SampleDataContainer, executor: CommandExecutor | callable) → Any[source]

Executes variant calling on the provided sample.

Parameters:

sample (SampleDataContainer) – The sample data container containing input data.
executor (Union[CommandExecutor, callable]) – Function or object to execute system commands.

Returns:

This method performs its task without returning a value.

Return type:

None

_abc_impl = <_abc._abc_data object>

_is_protocol = True

src.core.analyzer.primer_cutter module

This module contains classes and functions for preparing sequencing data by executing primer trimming operation.

Classes:

CutPrimers:
Handles execution of external primer cutting scripts on sequencing samples.
PTrimmer:
Performs primer sequence trimming from paired-end reads.
PrimerCutter:
Factory class for creating instances of primer-related data preparators based on specified cutter type.

Purpose:

This module facilitates data preprocessing steps essential for sequencing analysis pipelines, such as trimming primer sequences and cutting primers based on external scripts, while maintaining detailed logs of operations.

class src.core.analyzer.primer_cutter.CutPrimers(configurator: Configurator)[source]

Bases: LoggerMixin, IDataPreparator

Class responsible for executing primer cutting on sequencing data. Runs an external primer cutting script with specified parameters and logs progress.

perform(sample: SampleDataContainer, executor: CommandExecutor | callable) → tuple[PathLike, PathLike][source]

Executes the primer cutting process on the provided sample data.

This method constructs a command to run an external primer cutting script with the specified parameters, manages logging setup, and runs the command using the provided executor. It generates trimmed and untrimmed file paths, logs the execution details, and returns the paths to the trimmed read files.

Parameters:

sample (SampleDataContainer) – The sample data containing source file paths and processing directories.
executor (Union[CommandExecutor, callable]) – An executor object or function responsible for running the command.

Returns:

Paths to the trimmed R1 and R2 files.

Return type:

Tuple[PathLike[AnyStr], PathLike[AnyStr]]

_abc_impl = <_abc._abc_data object>

_is_protocol = False

class src.core.analyzer.primer_cutter.PTrimmer(configurator: Configurator)[source]

Bases: LoggerMixin, IDataPreparator

Class responsible for trimming primer sequences from paired-end reads.

It runs an external trimming tool and logs progress.

perform(sample: SampleDataContainer, executor: CommandExecutor | callable) → tuple[PathLike, PathLike][source]

Performs primer trimming on the sample’s read files.

Parameters:

sample (SampleDataContainer) – The sample data with source file paths.
executor (CommandExecutor or callable) – Executor for running commands.

Returns:

Tuple of paths to the trimmed R1 and R2 files.

_abc_impl = <_abc._abc_data object>

_is_protocol = False

class src.core.analyzer.primer_cutter.PrimerCutter(configurator: Configurator, logger: Logger | None = None)[source]

Bases: LoggerMixin

Factory class for creating primer-related data preparator instances. Provides a method to instantiate specific primer cutter classes based on name.

static create_primer_cutter(configurator: Configurator, cutter_name: str | None = 'cutprimers') → IDataPreparator[source]

Factory method to instantiate a primer cutter object based on the cutter_name.

Parameters:

configurator (Configurator) – Configuration object with parameters and logger.
cutter_name (str) – Name of the cutter type (‘cutprimers’ or ‘ptrimmer’).

Returns:

IDataPreparator instance corresponding to the cutter.

src.core.analyzer.sequence_aligner module

This module defines the SequenceAligner class, responsible for mapping sequencing reads to a reference genome using an aligner such as BWA-MEM2.

It handles the construction and execution of alignment commands, logging the process, and managing output files.

Classes:

SequenceAligner:
Performs read alignment to a reference genome, logs the process, and returns the path to the aligned reads file.

Main Features:

Constructs command-line instructions for BWA-MEM2.
Ensures log directories exist.
Handles sample information and reference genome input.
Manages output paths for alignment results.
Implements error handling with logging.

class src.core.analyzer.sequence_aligner.SequenceAligner(configurator)[source]

Bases: LoggerMixin, IDataPreparator

Class responsible for mapping sequencing reads to a reference genome. Utilizes an aligner like BWA-MEM2 to perform the mapping and logs the process.

perform(sample: SampleDataContainer, reference_source: PathLike, executor: CommandExecutor | callable) → PathLike[source]

Mapping reads to the reference human genome.

This is the stage at which, for each read, it is determined where a similar sequence is located in the reference genome, and their alignment is performed relative to each other.

Parameters:

sample (SampleDataContainer) – The container holding sample’s sequencing data, including raw reads path.
reference_source (PathLike[AnyStr]) – Path to the reference genome file to which reads will be aligned.
executor (Union[CommandExecutor, callable]) – The parameter is an external callable object or a special class to handling or/and wrapping system calls.

Returns:

A path to mapped reads file

Return type:

PathLike[AnyStr]

_abc_impl = <_abc._abc_data object>

_is_protocol = False

class src.core.analyzer.sequence_aligner.BWAAligner(configurator)[source]

Bases: LoggerMixin, IDataPreparator

Class responsible for mapping sequencing reads to a reference genome. Utilizes an aligner like BWA-MEM2 to perform the mapping and logs the process.

perform(sample: SampleDataContainer, reference_source: PathLike, executor: CommandExecutor | callable) → PathLike[source]

Mapping reads to the reference human genome.

This is the stage at which, for each read, it is determined where a similar sequence is located in the reference genome, and their alignment is performed relative to each other.

Parameters:

sample (SampleDataContainer) – The container holding sample’s sequencing data, including raw reads path.
reference_source (PathLike[AnyStr]) – Path to the reference genome file to which reads will be aligned.
executor (Union[CommandExecutor, callable]) – The parameter is an external callable object or a special class to handling or/and wrapping system calls.

Returns:

A path to mapped reads file

Return type:

PathLike[AnyStr]

_abc_impl = <_abc._abc_data object>

_is_protocol = False

src.core.analyzer.variant_caller module

This module contains classes for variant calling in genomic data analysis.

It defines a base class VariantCaller and specific implementations for different variant calling tools such as

Pisces, GATK’s UnifiedGenotyper, and FreeBayes.

The classes provide methods to execute variant calling commands, handle logging, and manage configurations.

The design promotes modularity and extensibility, allowing easy integration of additional variant callers by subclassing VariantCaller.

The use of a configurator object ensures flexible parameter management across different tools.

Usage:

Instantiate the specific variant caller class with the configuration.
Call the call_variant() method with a sample data container and executor to perform variant calling.

Note

The UnifiedGenotyperVariantCaller is deprecated; consider updating to newer GATK tools.

class src.core.analyzer.variant_caller.VariantCaller(configurator: Configurator, logger: Logger | None = None)[source]

Bases: LoggerMixin, IVariantCaller

Base class for variant callers.

Provides a common interface and shared functionality for specific variant caller implementations. Manages configuration and logging setup.

configurator

Configuration object containing parameters and paths.

Type:: Configurator

call_variant(sample: SampleDataContainer, executor: CommandExecutor | callable)[source]: Method to perform variant calling. To be implemented in subclasses.

_abc_impl = <_abc._abc_data object>

_is_protocol = False

class src.core.analyzer.variant_caller.PiscesVariantCaller(configurator: Configurator, logger: Logger | None = None)[source]

Bases: VariantCaller

Variant caller implementation using Pisces.

Executes the Pisces command-line tool for variant calling on a given sample.

call_variant(sample, executor)[source]: Performs variant calling and returns output VCF path.

call_variant(sample: SampleDataContainer, executor: CommandExecutor | callable)[source]

Executes variant calling using Pisces.

Parameters:

sample (SampleDataContainer) – Sample information including BAM path.
executor (Union[CommandExecutor, callable]) – Command executor.

Returns:

Path to the output VCF file.

Return type:

str

_abc_impl = <_abc._abc_data object>

_is_protocol = False

class src.core.analyzer.variant_caller.UnifiedGenotyperVariantCaller(configurator: Configurator, logger: Logger | None = None)[source]

Bases: VariantCaller

Deprecated GATK UnifiedGenotyper variant caller.

Issue warning indicating deprecation. Intended for use with older GATK versions.

call_variant(sample, executor)[source]: Placeholder with warning; does not perform actual calling.

call_variant(sample: SampleDataContainer, executor: CommandExecutor | callable) → None[source]: Method to perform variant calling. To be implemented in subclasses.

_abc_impl = <_abc._abc_data object>

_is_protocol = False

class src.core.analyzer.variant_caller.FreebayesVariantCaller(configurator: Configurator, logger: Logger | None = None)[source]

Bases: VariantCaller

Variant caller implementation using FreeBayes.

Executes the FreeBayes command-line tool for variant calling on a given sample.

call_variant(sample, executor)[source]: Performs variant calling.

call_variant(sample: SampleDataContainer, executor: CommandExecutor | callable)[source]

Executes variant calling using FreeBayes.

Parameters:

sample (SampleDataContainer) – Sample information including BAM and VCF paths.
executor (Union[CommandExecutor, callable]) – Command executor.

_abc_impl = <_abc._abc_data object>

_is_protocol = False

src.core.analyzer.variant_caller_factory module

This module defines a factory class for creating variant caller instances. It provides a way to select a specific variant calling tool (e.g., Pisces, GATK UnifiedGenotyper, FreeBayes) based on configuration settings. The factory ensures that the correct variant caller is instantiated and initialized with the appropriate configuration parameters.

class src.core.analyzer.variant_caller_factory.VariantCallerFactory(logger: Logger | None = None)[source]

Bases: LoggerMixin

Factory class for creating variant caller instances based on configuration.

Supports multiple variant calling tools such as: Pisces, GATK UnifiedGenotyper, and FreeBayes.

The factory uses the provided configuration to determine which variant caller to instantiate.

This promotes code modularity and allows for easy addition or removal of variant calling tools without affecting other parts of the application.

static create_caller(caller_config: dict[str, str], configurator: Configurator) → IVariantCaller[source]

Creates an instance of a variant caller: based on the provided configuration.

Parameters:

caller_config (Dict[str, str]) – Dictionary containing at least the ‘name’ key specifying the caller type.
configurator (Configurator) – Configuration object containing parameters and logger.

Returns:

An instance of the selected variant caller class.

Return type:

IVariantCaller

Raises:

ConfigurationError – If the caller type is not recognized.

src.core.analyzer package

Submodules

src.core.analyzer.adapter_trimmer module

src.core.analyzer.amplicon_coverage_computer module

src.core.analyzer.annotation_adapter module

src.core.analyzer.bam_grouper module

src.core.analyzer.bqsr_performer module

src.core.analyzer.i_data_preparator module

src.core.analyzer.i_variant_caller module

src.core.analyzer.primer_cutter module

src.core.analyzer.sequence_aligner module

src.core.analyzer.variant_caller module

src.core.analyzer.variant_caller_factory module

Module contents