CountsQC (opennano.qc)

The CountsQC class performs quality control (QC) checks on NanoString GeoMx data.

class opennano.qc.CountsQC(adata=None, dcc_directory=None, pkc_file=None, metadata_file=None, minSegmentReads=1000, percentTrimmed=80, percentStitched=80, percentAligned=75, percentSaturation=50, minNegativeCount=10, maxNTCCount=9000, minNuclei=20, minArea=1000, negative_probe_cutoff=1.1)

Bases: object

A class for performing quality control (QC) checks on an AnnData object.

__init__(adata=None, dcc_directory=None, pkc_file=None, metadata_file=None, minSegmentReads=1000, percentTrimmed=80, percentStitched=80, percentAligned=75, percentSaturation=50, minNegativeCount=10, maxNTCCount=9000, minNuclei=20, minArea=1000, negative_probe_cutoff=1.1)

Initialize the CountsQC class for performing quality control on GeoMx data.

This constructor initializes a CountsQC object either from an existing AnnData object or by processing .dcc, .pkc, and metadata files to generate an AnnData object. It also sets quality control thresholds for various metrics.

Parameters:
  • adata (AnnData, optional) – An existing AnnData object to initialize the QC process. If not provided, the dcc_directory, pkc_file, and metadata_file parameters must be specified to create the AnnData object.

  • dcc_directory (str, optional) – Path to the directory containing .dcc files. Required if adata is not provided.

  • pkc_file (str, optional) – Path to the .pkc file. Required if adata is not provided.

  • metadata_file (str, optional) – Path to the GEO SOFT metadata file. Required if adata is not provided.

  • minSegmentReads (int, default=1000) – Minimum number of reads required for a segment to pass QC.

  • percentTrimmed (int, default=80) – Minimum percentage of trimmed reads required for a segment to pass QC.

  • percentStitched (int, default=80) – Minimum percentage of stitched reads required for a segment to pass QC.

  • percentAligned (int, default=75) – Minimum percentage of aligned reads required for a segment to pass QC.

  • percentSaturation (int, default=50) – Minimum sequencing saturation percentage required for a segment to pass QC.

  • minNegativeCount (int, default=10) – Minimum count of negative probes required for a segment to pass QC.

  • maxNTCCount (int, default=9000) – Maximum count for no-template control (NTC) probes allowed for a segment to pass QC.

  • minNuclei (int, default=20) – Minimum number of nuclei required for a segment to pass QC.

  • minArea (int, default=1000) – Minimum area (in pixels or other units) required for a segment to pass QC.

Raises:

ValueError – If adata is not provided and any of dcc_directory, pkc_file, or metadata_file is missing.

Notes

  • If adata is not provided, the class processes the GeoMx data from the provided files (dcc_directory, pkc_file, and metadata_file) using the GeoMxProcessor class.

  • The initialized object contains metadata and quality control thresholds that can be used for running QC checks and generating filtered datasets.

adata

The AnnData object containing the GeoMx data, either provided or created during initialization.

Type:

AnnData

df

DataFrame representation of the expression matrix from the AnnData object.

Type:

pandas.DataFrame

roi_metadata

List of regions of interest (ROIs) from the unstructured metadata in the AnnData object.

Type:

list

neg_probe_indices

Indices of negative probes in the AnnData object.

Type:

pandas.Index

passed_rois

List of ROIs that pass all QC checks, initialized as empty.

Type:

list

calc_min_area(adata, idx)

Retrieve the minimum area for a specific ROI.

Parameters:
  • adata (AnnData) – The AnnData object containing expression and metadata.

  • idx (str) – The key of the ROI in the unstructured data (uns) of the AnnData object.

Returns:

The minimum area for the ROI.

Return type:

int

calc_min_nuclei(adata, idx)

Retrieve the minimum nuclei count for a specific ROI.

Parameters:
  • adata (AnnData) – The AnnData object containing expression and metadata.

  • idx (str) – The key of the ROI in the unstructured data (uns) of the AnnData object.

Returns:

The minimum nuclei count for the ROI.

Return type:

int

calc_percent_aligned(adata, idx)

Calculate the percentage of aligned reads for a specific ROI.

Parameters:
  • adata (AnnData) – The AnnData object containing expression and metadata.

  • idx (str) – The key of the ROI in the unstructured data (uns) of the AnnData object.

Returns:

The percentage of reads that were aligned.

Return type:

float

calc_percent_saturation(adata, idx)

Calculate the sequencing saturation for a specific ROI.

Parameters:
  • adata (AnnData) – The AnnData object containing expression and metadata.

  • idx (str) – The key of the ROI in the unstructured data (uns) of the AnnData object.

Returns:

The sequencing saturation value for the ROI.

Return type:

float

calc_percent_stitched(adata, idx)

Calculate the percentage of stitched reads for a specific ROI.

Parameters:
  • adata (AnnData) – The AnnData object containing expression and metadata.

  • idx (str) – The key of the ROI in the unstructured data (uns) of the AnnData object.

Returns:

The percentage of reads that were stitched.

Return type:

float

calc_percent_trimmed(adata, idx)

Calculate the percentage of trimmed reads for a specific ROI.

Parameters:
  • adata (AnnData) – The AnnData object containing expression and metadata.

  • idx (str) – The key of the ROI in the unstructured data (uns) of the AnnData object.

Returns:

The percentage of reads that were trimmed.

Return type:

float

calc_total_reads(adata, idx)

Calculate the total number of reads for a specific region of interest (ROI).

Parameters:
  • adata (AnnData) – The AnnData object containing expression and metadata.

  • idx (str) – The key of the ROI in the unstructured data (uns) of the AnnData object.

Returns:

The sum of all reads for the specified ROI.

Return type:

int

check_metric(metric_name, threshold, calc_function, unit='%')

Evaluates a specific quality control (QC) metric for each segment and identifies segments that fail.

This method applies a calculation function (calc_function) to compute a QC metric for each region of interest (ROI) in the dataset. It compares the computed values against a specified threshold to determine which segments pass or fail the QC check.

Parameters:
  • metric_name (str) – Name of the metric being checked (e.g., “Percent Trimmed”, “Total Reads”).

  • threshold (float or int) – Minimum acceptable value for the metric. Segments with values below this threshold fail the QC check.

  • calc_function (callable) – A function that calculates the metric for a given segment. It should take the adata object and a segment identifier (idx) as inputs and return the computed value.

  • unit (str, optional) – Unit of the metric for display purposes (default is “%”).

Returns:

A set of segment identifiers (ROIs) that pass the QC check.

Return type:

set

Raises:

Exception – If the calc_function encounters an error during computation.

Notes

  • This method iterates over all segment identifiers (roi_metadata) in the dataset.

  • Segments that fail the QC check are printed with a warning message, displaying the metric value and the threshold.

  • The progress and percentage of passing segments are displayed using _print_progress.

Examples

Define a metric calculation function:

def calc_total_reads(adata, idx):
    return adata[idx].sum()

Check the “Total Reads” metric:

passed_segments = qc.check_metric(
    metric_name="Total Reads",
    threshold=1000,
    calc_function=calc_total_reads,
    unit="reads"
)
print("Segments passing QC:", passed_segments)
static filter_by_negativeProbes(adata, negative_probe_cutoff=1.1, save_negatives=False)

Filters genes based on their background ratios compared to negative probes.

Parameters:
  • adata (AnnData) – The AnnData object containing gene expression data.

  • cutoff (float, optional) – The threshold for filtering genes based on their background ratios. Genes with ratios below this threshold are removed (default is 1.1).

  • save_negatives (bool, optional) – If True, returns a second AnnData object containing only the negative probes (default is False).

Returns:

  • If save_negatives is False, returns the filtered AnnData object.

  • If save_negatives is True, returns a tuple: (filtered AnnData object, AnnData object of negative probes).

Return type:

AnnData or tuple of AnnData

Raises:

ValueError – If the adata object does not contain the required “SystematicName” column.

Examples

Filter genes by negative probes and save the negative probes:

qc = QC()
filtered_adata, negatives = qc.filter_by_negativeProbes(adata, cutoff=1.5, save_negatives=True)

Filter genes by negative probes without saving negatives:

qc = QC()
filtered_adata = qc.filter_by_negativeProbes(adata, cutoff=1.2)
plot_before_after_filtering()

Generate visualizations to compare data before and after QC filtering.

This method creates visualizations to show the differences in expression data before and after applying QC filters. The plots generated include:

  1. Scatter plots of total expression sums per sample (before and after filtering).

  2. Histograms of expression sum distributions (before and after filtering).

Parameters:

None

Returns:

The method generates and displays plots but does not return any data.

Return type:

None

Notes

  • This method uses the run_all_checks method to filter the data based on QC metrics.

  • The raw and filtered AnnData objects are compared to highlight the impact of QC filtering.

  • The expression sums are computed across all samples for visualization.

Examples

qc = CountsQC(adata=my_adata)
qc.plot_before_after_filtering()
plot_qc_results()

Generate visualizations for Quality Control (QC) metrics.

This method generates a series of plots to visualize QC metrics across all regions of interest (ROIs) in the dataset. The visualizations include:

  1. Bar plot showing the percentage of segments passing QC thresholds.

  2. Histograms for each metric with optional threshold overlays.

  3. A heatmap of QC failures across metrics.

  4. A scatter plot comparing “Percent Trimmed” and “Percent Stitched” reads.

Thresholds for each metric are defined in the class attributes (e.g., minSegmentReads, percentTrimmed). Metrics without defined thresholds are visualized without overlays.

Parameters:

None

Returns:

This method generates and displays the plots but does not return any data.

Return type:

None

Notes

  • The metrics visualized include:
    • Total Reads

    • Percent Trimmed

    • Percent Stitched

    • Percent Aligned

    • Percent Saturation

    • Min Nuclei

    • Min Area

  • Metrics without valid data or missing thresholds are handled gracefully.

  • If thresholds are defined, they are indicated on the plots as dashed lines.

Example

qc = CountsQC(adata=my_adata)
qc.plot_qc_results()
run_all_checks(return_negative_probes=False, negative_probe_cutoff=None)

Run all QC checks and return filtered AnnData objects.

Parameters:
  • return_negative_probes (bool, optional) – Whether to return an AnnData object containing only the negative probes (default is False).

  • negative_probe_cutoff (float, optional) – The cutoff for filtering genes based on their background ratios compared to negative probes (default is 1.1).

Returns:

  • filtered_adata (AnnData) – AnnData object with only ROIs passing all QC checks.

  • negative_probes_adata (AnnData, optional) – AnnData object with only negative probes (if requested).

write(filename: str = None, compression: Literal['gzip', 'lzf'] = None)

Write the Anndata object to disk.

Parameters:
  • filename (str) – The name and the location of the file to write the Anndata object to.

  • compression (str, optional) – Compression strategy to use (‘gzip’ or ‘lzf’).

Raises:

ValueError: – If the ‘filename’ is not provided or is invalid.