API Documentation

Tools

Preprocessing

singlecell_cookbook.tools.preprocess.subset_obs(adata, subset)[source]

Subset observations (rows) in an AnnData object.

This function modifies the AnnData object in-place by selecting a subset of observations based on the provided subset parameter. The subsetting can be done using observation names, integer indices, a boolean mask, or a pandas Index.

Parameters:
  • adata (AnnData) – The annotated data matrix to subset. Will be modified in-place.

  • subset (pd.Index | Sequence[str | int | bool]) – The subset specification. Can be one of: * A pandas Index containing observation names * A sequence of observation names (strings) * A sequence of integer indices * A boolean mask of length adata.n_obs

Return type:

None

Examples

>>> # Create an example AnnData object
>>> import anndata
>>> import pandas as pd
>>> import numpy as np
>>> adata = anndata.AnnData(
...     X=np.array([[1, 2], [3, 4], [5, 6]]),
...     obs=pd.DataFrame(index=['A', 'B', 'C']),
...     var=pd.DataFrame(index=['gene1', 'gene2'])
... )
>>> # Subset using pandas Index
>>> subset_obs(adata, pd.Index(['B', 'C']))
>>> adata.obs_names.tolist()
['B', 'C']
>>> # Subset using observation names
>>> subset_obs(adata, ['A', 'B'])
>>> adata.obs_names.tolist()
['A', 'B']
>>> # Subset using integer indices
>>> subset_obs(adata, [0, 1])
>>> adata.obs_names.tolist()
['A', 'B']
>>> # Subset using boolean mask
>>> subset_obs(adata, [True, False, True])
>>> adata.obs_names.tolist()
['A', 'C']

Notes

  • The function modifies the AnnData object in-place

  • When using a boolean mask, its length must match the number of observations

  • When using integer indices, they must be valid indices for the observations

  • Invalid observation names or indices will raise KeyError or IndexError respectively

  • The order of observations in the output will match the order in the subset parameter

singlecell_cookbook.tools.preprocess.subset_var(adata, subset)[source]

Subset variables (columns) in an AnnData object.

This function modifies the AnnData object in-place by selecting a subset of variables based on the provided subset parameter. The subsetting can be done using variable names, integer indices, a boolean mask, or a pandas Index.

Parameters:
  • adata (AnnData) – The annotated data matrix to subset. Will be modified in-place.

  • subset (pd.Index | Sequence[str | int | bool]) – The subset specification. Can be one of: * A pandas Index containing variable names * A sequence of variable names (strings) * A sequence of integer indices * A boolean mask of length adata.n_vars

Return type:

None

Examples

>>> # Create an example AnnData object
>>> import anndata
>>> import pandas as pd
>>> import numpy as np
>>> adata = anndata.AnnData(
...     X=np.array([[1, 2, 3], [4, 5, 6]]),
...     obs=pd.DataFrame(index=['cell1', 'cell2']),
...     var=pd.DataFrame(index=['gene1', 'gene2', 'gene3'])
... )
>>> # Subset using pandas Index
>>> subset_var(adata, pd.Index(['gene2', 'gene3']))
>>> adata.var_names.tolist()
['gene2', 'gene3']
>>> # Subset using variable names
>>> subset_var(adata, ['gene1', 'gene2'])
>>> adata.var_names.tolist()
['gene1', 'gene2']
>>> # Subset using integer indices
>>> subset_var(adata, [0, 1])
>>> adata.var_names.tolist()
['gene1', 'gene2']
>>> # Subset using boolean mask
>>> subset_var(adata, [True, False, True])
>>> adata.var_names.tolist()
['gene1', 'gene3']

Notes

  • The function modifies the AnnData object in-place

  • When using a boolean mask, its length must match the number of variables

  • When using integer indices, they must be valid indices for the variables

  • Invalid variable names or indices will raise KeyError or IndexError respectively

  • The order of variables in the output will match the order in the subset parameter

singlecell_cookbook.tools.preprocess.pool_neighbors(adata, *, layer=None, n_neighbors=None, neighbors_key=None, weighted=False, directed=True, key_added=None, copy=False)[source]

Given an adjacency matrix, pool cell features using a weighted sum of feature counts from neighboring cells. The weights are the normalized connectivities from the adjacency matrix.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • layer (str, optional) – Layer in AnnData object to use for pooling. Defaults to None.

  • n_neighbors (int, optional) – Number of neighbors to consider. Defaults to None.

  • neighbors_key (str, optional) – Key in AnnData object to use for neighbors. Defaults to None.

  • weighted (bool, optional) – Whether to weight neighbors by their connectivities in the adjacency matrix. Defaults to False.

  • directed (bool, optional) – Whether to use directed or undirected neighbors. Defaults to True.

  • key_added (str, optional) – Key to use in AnnData object for the pooled features. Defaults to None.

  • copy (bool, optional) – Whether to return a copy of the pooled features instead of modifying the original AnnData object. Defaults to False.

Returns:

The pooled features if copy is True, otherwise None.

Return type:

csr_matrix | ndarray | None

Analysis

singlecell_cookbook.tools.analysis.go_enrichment(ontology, genes_universe, genes_list, verbosity=2)[source]

Runs a GO enrichment analysis.

Parameters:
  • ontology (ParsedOntology) – Parsed Gene Ontology object.

  • genes_universe (list[str]) – List of all genes in the universe.

  • genes_list (list[str]) – List of genes to test for enrichment.

  • verbosity (int, optional) – How much output to display. Defaults to 2.

Returns:

A DataFrame with the results of the enrichment analysis. The index is the GO term identifier and the columns are “hits”, “background”, “pvalue”, “bonferroni”, and “benjamini”. The “hits” and “background” columns are integer values and the others are float values.

Return type:

pandas.DataFrame

singlecell_cookbook.tools.analysis.parse_obo(obo_filepath, gene2go_associations)[source]

Parses an OBO file and gene2go associations dataframe into a graph.

The graph is represented as a dictionary where each key is a node ID and the value is a set of node IDs that are connected to the node.

Parameters:
  • obo_filepath (str | Path) – Path to the OBO file.

  • gene2go_associations (pd.DataFrame) – Dataframe containing the gene2go associations. It must contain at least columns gene_id and go_term_id.

Returns:

NamedTuple containing the nodes list, the GO terms dataframe and the graph.

Return type:

ParsedOntology

class singlecell_cookbook.tools.analysis.ParsedOntology(nodes_list, goterms, graph, gene2go)[source]

Bases: NamedTuple

Parameters:
  • nodes_list (list[dict])

  • goterms (DataFrame)

  • graph (dict[str, set[str]])

  • gene2go (DataFrame)

nodes_list: list[dict]

Alias for field number 0

goterms: DataFrame

Alias for field number 1

graph: dict[str, set[str]]

Alias for field number 2

gene2go: DataFrame

Alias for field number 3

singlecell_cookbook.tools.analysis.edger_pseudobulk(adata_, group_key, condition_group=None, reference_group=None, cell_identity_key=None, layer=None, replicas_per_group=10, min_cells_per_group=30, bootstrap_sampling=True, use_cells=None, aggregate=True, verbosity=0)[source]

Fits a model using edgeR and computes top tags for a given condition vs reference group.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • group_key (str) – Key in AnnData object to use to group cells.

  • condition_group (str | list[str] | None, optional) – Condition group to compare to reference group. If None, each group will be contrasted to the corresponding reference group.

  • reference_group (str | None, optional) – Reference group to compare condition group(s) to. If None, the condition group is compared to the rest of the cells.

  • cell_identity_key (str | None, optional) – If provided, separate contrasts will be computed for each identity. Defaults to None.

  • layer (str | None, optional) – Layer in AnnData object to use. EdgeR requires raw counts. Defaults to None.

  • replicas_per_group (int, optional) – Number of replicas to create for each group. Defaults to 10.

  • min_cells_per_group (int, optional) – Minimum number of cells required for a group to be included. Defaults to 30.

  • bootstrap_sampling (bool, optional) – Whether to use bootstrap sampling to create replicas. Defaults to True.

  • use_cells (dict[str, list[str]] | None, optional) – If not None, only use the specified cells. Defaults to None. Dictionary key is a categorical variable in the obs dataframe and the dictionary value is a list of categories to include.

  • aggregate (bool, optional) – Whether to aggregate cells before fitting the model. EdgeR requires a small number of samples, so if adata_ is a single-cell experiment, the cells should be aggregated. Defaults to True.

  • verbosity (int, optional) – Verbosity level. Defaults to 0.

  • adata_ (AnnData)

Returns:

Dictionary of dataframes, one for each contrast, with the following columns:

  • gene_idsstr

    Gene IDs.

  • logFCfloat

    Log2 fold change.

  • logCPMfloat

    Log2 CPM.

  • F: float

    F-statistic.

  • PValuefloat

    p-value.

  • FDRfloat

    False discovery rate.

  • pct_expr_cndfloat

    Percentage of cells in condition group expressing the gene.

  • pct_expr_reffloat

    Percentage of cells in reference group expressing the gene.

Return type:

dict[str, pd.DataFrame]