Curating public aptamer databases for downstream analysis

Integration of three curated aptamer databases into one clean, fully-categorized dataset of aptamers across all target categories — the dataset behind the interactive neighbour explorer. Generated from aptamer_benchmark/.

3
source DBs
3,373
curated records
2,782
unique sequences
2,329
sequence families
2,710
with Kd

1. Source databases

3 public, manually-curated aptamer collections were combined. They differ in size, accessibility and licensing (record counts are this dataset's contribution from each, summing to 3,373):

DatabaseRecordsBulk accessLicense
UTexas Aptamer DB (Ellington lab)1,495Zenodo xlsxCC BY 3.0 US
AptaDB (ECUST)1,350scraped (no API)academic/unclear
APTABASE (IIT Guwahati)528scrapedacademic/unclear

UTexas is the most permissively licensed and richest (pre-normalized Kd, DNA/RNA type, buffer/pH); AptaDB carries UniProt accessions enabling principled target typing; APTABASE needed encoding clean-up. Provenance and a redistributable flag are kept per row.

2. Cleaning & integration pipeline

  1. Ingest all three into one long schema (one row per aptamer-affinity record) with provenance.
  2. Clean sequences: fix mojibake, strip 5'/3' markers, uppercase, U→T canonical key; derive DNA/RNA and a chemical-modification flag; recompute length & GC.
  3. Normalize affinity to a numeric Kd in nM (pM/µM/mM converted); flag approx/range; skip association constants (Ka).
  4. Categorize targets by rule: pathogens (virus/bacterium/toxin/parasite, host-gated) → whole cells → cell-surface proteins (AptaDB via UniProt subcellular location; others via surface markers) → drop the rest.
  5. Deduplicate: exact canonical key collapses identical aptamers across DBs (keeping all affinity records); containment clustering merges primer-trimmed / truncation variants into sequence families.
  6. Export the curated dataset + datasheet.
curation funnel

From 3,455 ingested records, 3,373 had a usable sequence and were categorized into 11 target classes (nothing dropped); these collapse to 2,782 unique sequences in the neighbour explorer (see §3).

3. The dataset for the neighbour explorer

The interactive neighbour explorer is built from artifacts/ml_dataset_all_targets.parquet: the integrated, de-duplicated records across all 11 target classes — whole / complex targets virus, bacterium, toxin, parasite (whole pathogens & toxins) and cell (whole-cell SELEX), purified / molecular targets protein (surface receptors flagged is_cell_surface), peptide, small molecule, nucleic acid and tissue, plus a residual other. Nothing is excluded. Identical sequences from different databases were merged — 351 sequences are shared across ≥2 databases. (A whole/complex-target subset, ml_dataset_complex_targets.parquet, feeds the separate representation study.)

4. Characterization of the curated data

Target classRecordsUnique seqsFamiliesWith KdDistinct targets
virus398354302310128
bacterium21920316817070
toxin14812911113137
parasite4436342214
cell266227202233134
protein1857158113071520767
peptide5956565127
small molecule663505441514255
nucleic acid12121283
tissue17157116
other7466574441
Total (records)33732782232927101319

384 record(s) validated in more than one context carry multiple target classes and are counted once in each, so per-class rows can sum to slightly more than the 3,373 total records.

Source contribution per class — all three DBs feed every class:

source contribution by class

Chemistry — 2297 DNA and 1076 RNA aptamers:

chemistry per class

Sequence length (mean 66 nt, range 5–317), split by chemistry — DNA and RNA shown as separate distributions:

length distribution by chemistry (DNA vs RNA)

Binding affinity — 2710 records with a numeric Kd, spanning pM to mM:

Kd distribution

5. Caveats for downstream modeling

Generated by aptamer_benchmark/make_report.py from the pipeline artifacts. Sources: UTexas Aptamer Database (NAR 2024), AptaDB (RNA 2024), APTABASE (IIT Guwahati). UniProt for subcellular-location annotation.