GeoMxProcessor (opennano.io)

The GeoMxProcessor class is designed for processing NanoString GeoMx data. It combines file validation, parsing, and data integration into a single interface, enabling efficient workflows for users.

Features: - Parse .dcc and .pkc files. - Extract metadata from GEO SOFT files. - Create AnnData objects for downstream analysis.

The functions read_dcc_files (for one file), read_all_dcc_files (for multiple files), read_pkc, parse_geo_soft_metadata_with_identifier, and create_single_anndata are also available for users who want to perform these operations separately.

Class Documentation

class opennano.io.GeoMxProcessor(dcc_files, pkc_file, metadata_file)

Bases: object

A class for processing NanoString GeoMx data.

Combines file validation, parsing, and data integration into a single interface.

__init__(dcc_files, pkc_file, metadata_file)

Initialize the GeoMxProcessor with paths to .dcc, .pkc, and GEO SOFT metadata files.

Parameters:
  • dcc_files (str) – Path to the directory containing .dcc files.

  • pkc_file (str) – Path to the .pkc file.

  • metadata_file (str) – Path to the GEO SOFT metadata file.

Raises:

ValueError – If the provided paths are invalid.

static create_single_anndata(dcc_data, pkc_data, sample_metadata_df)

Creates a single AnnData object with RTS_IDs as observations (obs) and ROIs (samples) as variables (vars).

This function integrates parsed .dcc files, .pkc probe information, and GEO SOFT metadata to construct an AnnData object. The resulting AnnData object contains:

  • obs: Information about RTS_IDs (features).

  • var: Information about samples (ROIs).

  • X: Expression counts aligned by RTS_IDs and ROIs.

  • uns: Unstructured metadata for further analysis.

Parameters:
  • dcc_data (dict) – A dictionary where keys are file names of .dcc files, and values are parsed dictionaries. Each parsed dictionary should include a Code_Summary DataFrame containing RTS_ID and Count.

  • pkc_data (dict) – A dictionary where keys are gene names and values are lists of probe dictionaries. Each probe dictionary should include fields like RTS_ID, SystematicName, GeneID, and ProbeID.

  • sample_metadata_df (pandas.DataFrame) – A DataFrame containing sample-level metadata parsed from GEO SOFT files. Each row should represent a sample and must include the column specified by description_col.

  • description_col (str, optional) – The column name in sample_metadata_df that contains the descriptions matching .dcc filenames. Default is “!Sample_description”.

Returns:

An AnnData object where: - obs contains RTS_ID-level metadata. - var contains ROI-level (sample-level) metadata. - X contains expression counts. - uns contains unstructured metadata mapped to samples.

Return type:

ad.AnnData

Raises:
  • KeyError – If required columns like RTS_ID or Count are missing in dcc_data or pkc_data.

  • ValueError – If there are mismatches between .dcc data and metadata descriptions.

Examples

Example Data Structure:

 dcc_data = {
    "sample1.dcc": {
         "Code_Summary": pd.DataFrame({
             "RTS_ID": ["RTS001", "RTS002"],
             "Count": [10, 20]
        })
    }
 }
 pkc_data = {
     "Gene1": [{"RTS_ID": "RTS001", "ProbeID": "P001", "GeneID": "G1"}, {"RTS_ID": "RTS002", "ProbeID": "P002", "GeneID": "G1"}],
     "Gene2": [{"RTS_ID": "RTS003", "ProbeID": "P003", "GeneID": "G2"}]
 }
sample_metadata_df = pd.DataFrame({
    "!Sample_description": ["DSP-123-A-S1", "DSP-123-A-S2"],
     "!Sample_title": ["Sample 1", "Sample 2"]
 })
 adata = GeoMxProcessor.create_single_anndata(dcc_data, pkc_data, sample_metadata_df)
 print(adata)

Expected Output:

AnnData object with n_obs × n_vars = 3 × 2
parse_dcc_files()

Parses all .dcc files in the specified directory and stores their content.

This method validates that the self.dcc_files attribute points to a valid directory, then uses the read_all_dcc_files method to parse all .dcc files in the directory. The parsed content is stored in the self.dcc_data attribute for further processing.

Parameters:

None – This method operates on the self.dcc_files attribute, which should contain the path to the directory with .dcc files.

Returns:

The parsed .dcc file data is stored in the self.dcc_data attribute as a dictionary.

Return type:

None

Raises:

ValueError – If the self.dcc_files attribute is not a valid directory.

Notes

This method relies on read_all_dcc_files to perform the parsing of .dcc files.

The parsed data is structured as a dictionary, with .dcc file names as keys and the parsed content as values.

static parse_geo_soft_metadata_with_identifier(file_path)

Parses a GEO SOFT metadata file to extract series and sample metadata.

Parameters:

file_path (str) – Path to the GEO SOFT format metadata file.

Returns:

  • series_metadata (dict): General series-level metadata.

  • sample_metadata_df (pandas.DataFrame): Sample-specific metadata with structured characteristics.

Return type:

tuple

Raises:
  • FileNotFoundError – If the metadata file is not found.

  • ValueError – If the file is not in the correct GEO SOFT format.

Examples

Example GEO SOFT Metadata File:

^SERIES = GSE12345
!Series_title = Example Series
^SAMPLE = GSM123456
!Sample_title = Sample 1
!Sample_description = DSP-123-A-S1

Parsing the File:

series_metadata, sample_metadata = GeoMxProcessor.parse_geo_soft_metadata_with_identifier("metadata.txt")
 print(series_metadata)
 # Output:
 # {'^SERIES': 'GSE12345', '!Series_title': 'Example Series'}

 print(sample_metadata)
 # Output:
 #   Sample_ID   !Sample_title     !Sample_description
 # 0  GSM123456     Sample 1          DSP-123-A-S1
parse_metadata()

Parses the GEO SOFT metadata file and stores its content.

This method uses the parse_geo_soft_metadata_with_identifier function to process the GEO SOFT metadata file specified in the self.metadata_file attribute. The parsed content is stored in the self.series_metadata and self.metadata attributes.

Parameters:

None – This method operates on the self.metadata_file attribute, which should contain the path to the GEO SOFT metadata file.

Returns:

The parsed metadata is stored in the self.series_metadata (series-level metadata) and self.metadata (sample-level metadata as a pandas DataFrame) attributes.

Return type:

None

Raises:
  • FileNotFoundError – If the specified metadata file does not exist.

  • ValueError – If the metadata file is not in the correct GEO SOFT format.

Notes

  • The series-level metadata (e.g., general information about the series) is stored as a dictionary in self.series_metadata.

  • The sample-level metadata (e.g., details about individual samples) is stored as a pandas DataFrame in self.metadata.

  • This method is part of the high-level data integration workflow and assumes that the metadata file exists and is formatted correctly.

parse_pkc_file()

Parse the .pkc file and store its content using read_pkc.

process()

High-level method to process GeoMx data and create an AnnData object.

This method serves as a single entry point for parsing and integrating GeoMx data. It orchestrates the following steps:

  1. Parses .dcc files to extract code summary and metadata.

  2. Parses .pkc files to retrieve probe information.

  3. Parses GEO SOFT metadata files for series and sample metadata.

  4. Combines all parsed data into a unified AnnData object.

Returns:

A unified AnnData object containing: - obs: Metadata about RTS_IDs (features). - var: Metadata about ROIs (regions of interest or samples). - X: Expression counts aligned by RTS_IDs and ROIs. - uns: Unstructured metadata for further analysis.

Return type:

AnnData

Raises:

ValueError – If any of the required files (dcc_files, pkc_file, or metadata_file) are invalid or not properly formatted.

Notes

This method leverages three internal parsing methods: - parse_dcc_files: Parses and validates .dcc files. - parse_pkc_file: Parses and validates .pkc files. - parse_metadata: Extracts metadata from GEO SOFT files.

The final AnnData object is created using GeoMxProcessor.create_single_anndata, which integrates all parsed data into a single, consistent format for downstream analysis.

Examples

Process GeoMx data and generate an AnnData object:

processor = GeoMxProcessor(
    dcc_files="path/to/dcc_directory",
    pkc_file="path/to/probes.pkc",
    metadata_file="path/to/metadata.txt"
)
adata = processor.process()
print(adata)
# Output: AnnData object with n_obs × n_vars = 100 × 10
static read_dcc_files(file_path)

Parses .dcc files into structured data.

Parameters:

file_path (str) – Path to the directory containing .dcc files to parse.

Returns:

Dictionary of dictionaries for sections of the .dcc file with keys: - “Header”: Metadata from the header section. - “Scan_Attributes”: Scan-specific metadata. - “NGS_Processing_Attributes”: Processing-specific metadata. - “Code_Summary”: DataFrame with RTS_IDs and counts.

Return type:

dict

Raises:
  • FileNotFoundError – If the specified .dcc files path is not found.

  • ValueError – If the files are not correctly formatted.

Examples

Example .dcc File Content:

<Header>
 key1,value1
 key2,value2
 </Header>
 <Code_Summary>
 RTS001,10
 RTS002,20

Parsing the File:

dcc_data = GeoMxProcessor.read_dcc_files("path_to_DCC/")
print(dcc_data["DCC_1"])
# Output:
# {'Header': 'values1', 'Scan_Attributes': 'values2',
#  'NGS_Processing_Attributes' : 'values3', 'Code_Summary' : 'values4'}

print(dcc_data["DCC_1"]["Code_Summary"])
# Output:
#  RTS_ID  Count
# 0  RTS001     10
# 1  RTS002     20
static read_pkc(file_path)

Parses a .pkc file into a dictionary of probe information.

Parameters:

file_path (str) – Path to the .pkc file to parse.

Returns:

Dictionary where keys are gene names, and values are lists of probe dictionaries.

Return type:

dict

Raises:
  • FileNotFoundError – If the .pkc file does not exist.

  • JSONDecodeError – If the .pkc file is not valid JSON.

Examples

Example .pkc File Content:

{
    "Targets": [
      {"DisplayName": "Gene1", "Probes": [{"RTS_ID": "RTS001", "ProbeID": "P001"}]},
      {"DisplayName": "Gene2", "Probes": [{"RTS_ID": "RTS002", "ProbeID": "P002"}]}
     ]
 }

Parsing the File:

pkc_data = GeoMxProcessor.read_pkc("probes.pkc")
 print(pkc_data["Gene1"])
 # Output:
 # [{'RTS_ID': 'RTS001', 'ProbeID': 'P001'}]