Skip to content

The extract command processes genetic barcoding data from raw sequencing files or BAM files generated by a wide array of technologies. It requires two input files:

  • YAML Configuration File: Specifies the parameters for the extraction process.
  • CSV Inputs File: Defines the file paths for the sequencing data.

Outputs

  • Count matrices for each sample, saved separately.
  • A final merged count matrix combining all samples in anndata and csv formats.

Usage

On a terminal run the command as follows:

quicat extract <path_to_yaml>

Input Files

YAML File Example

This file specifies all the running parameters for the extract pipeline. Depending on the parameters provided, QuiCAT will either use the reference-free or reference-based workflow, enable sequencing error correction, and manage error tolerance during the extraction or alignment phases of the respective workflows.

You can find a high-level overview of the QuiCAT extract workflow and how the specified parameters guide the software’s decision-making during execution in the figure below. Overview of the extract workflow

config:
  sequencing_technology: (str) "10x"
  input: (str) "bam"
  paired_end: (bool) false
  barcodes_path: (bool) false
  phred33: (bool) true
  read_qc_threshold: (int) 20
  read_qc_percentage: (int) 80
  filter_barcodes_relative_abundance: (float) 0.001
  filter_barcodes_raw_numbers: (int) null
  reference: (str) "GCTACTTGAT*ATCCTACTTG"
  contig: (str) '*'
  flanked_pattern: (bool) true
  min_read_length: (int) 40
  max_read_length: (int) 40
  read_length: (int) 40
  aln_mismatches: (int) null
  flanking_mismatches: (float) null
  left_flanking_coverage: (int) null
  right_flanking_coverage: (int) null
  distance_threshold: (int) 8
  barcode_ratio: (int) 5
  n_threads: (int) -1

input_csv: (str) "/path/to/input.csv"
output_path: (str) "/path/to/output"

YAML Parameters: in-depth explanation

config section

general parameters

Parameter Description Default Value
sequencing_technology Technology (e.g., "10x", "Parse", "DNA"). "10x"
input File format ("fastq" or "bam"). "bam"
paired_end Set true for paired-end reads (only for fastq). false
threads Set the threads to use, default to using all available threads. -1

QC thresholds

Parameter Description Default Value
read_qc_threshold Minimum quality score. 20
read_qc_percentage Minimum percentage of bases that meet the quality score. 80
phred33 Set true for Phred+33 or false for Phred+64. true

barcodes filters

Parameter Description Default Value
filter_barcodes_relative_abundance In the final count matrix only retain barcodes with relative abundance above this value. 0.001
filter_barcodes_raw_numbers In the final count matrix only retain barcodes with raw count above this value. null

Note

Frequencies or raw counts intended over individual samples.

barcodes extraction

Parameter Description Default Value
reference Sequence for alignment (string or path to a file).
contig Contig to extract from BAM files ('*' for unmapped reads). '*'
flanked_pattern true if the reference specify a flanked_pattern e.g. seq1*seq2 false
aln_mismatches If specified in conjuction with a reference file, switch from aho-corasick to optimal aligner 0
flanking_mismatches If specified in conjuction with a flanked reference, switch from regex to cutadapt 0.0
left_flanking_coverage Minimum base pairs overlap on the left flanking sequence 0
right_flanking_coverage Minimum base pairs overlap on the right flanking sequence 0
flanked_pattern true if the reference specify a flanked_pattern e.g. seq1*seq2 false
min_read_length Minimum barcode length to retain. null
max_read_length Maximum barcode length to retain. null
read_length Fixed barcode length to retain. null

Note

Use min_read_length and max_read_length in combination to specify a range.

barcodes collapsing

Parameter Description Default Value
distance_threshold Distance theshold for barcode collapse. If set to 0, barcodes are not collapsed null
barcodes_ratio Minimum fold for a barcode to be collapsed into one with more counts. null

I/O

Parameter Description Default Value
input_csv Path to csv file with input files paths. See below. null
output_path Path to the folder where to store the pipeline's output. null

CSV File Example

Specify the input files path and metadata through the input csv file. The file must have the following columns:

sample,fastq_path_r1,fastq_path_r2,barcodes_path,bam_path,condition,replicate

Explanation:

Parameter Description Notes
sample Sample name as added in the final count matrix.
fastq_path_r1 Path to R1 fastq file, or only fastq file if not paired end. Leave blank if using bam
fastq_path_r2 Path to R2 fastq file. Leave blank if not paired end
barcodes_path Path to barcodes.tsv.gz if doing (single cell or spatial only). Leave blank if no cell/spot whitelist is needed
bam_path Path to bam file. Leave blank if using fastq
condition Specify disease or treatment status (added to final count matrix) Optional
replicate If the sample is a technical replicate, specify the general sample name here (added to final count matrix) Optional