Integration of three curated aptamer databases into one clean, fully-categorized
dataset of aptamers across all target categories — the dataset behind the interactive
neighbour explorer. Generated from aptamer_benchmark/.
3 public, manually-curated aptamer collections were combined. They differ in size, accessibility and licensing (record counts are this dataset's contribution from each, summing to 3,373):
| Database | Records | Bulk access | License |
|---|---|---|---|
| UTexas Aptamer DB (Ellington lab) | 1,495 | Zenodo xlsx | CC BY 3.0 US |
| AptaDB (ECUST) | 1,350 | scraped (no API) | academic/unclear |
| APTABASE (IIT Guwahati) | 528 | scraped | academic/unclear |
UTexas is the most permissively licensed and richest (pre-normalized Kd, DNA/RNA
type, buffer/pH); AptaDB carries UniProt accessions enabling principled target typing;
APTABASE needed encoding clean-up. Provenance and a redistributable flag
are kept per row.
From 3,455 ingested records, 3,373 had a usable sequence and were categorized into 11 target classes (nothing dropped); these collapse to 2,782 unique sequences in the neighbour explorer (see §3).
The interactive neighbour explorer is built from
artifacts/ml_dataset_all_targets.parquet: the integrated, de-duplicated records
across all 11 target classes — whole / complex targets virus, bacterium, toxin,
parasite (whole pathogens & toxins) and cell (whole-cell SELEX), purified /
molecular targets protein (surface receptors flagged is_cell_surface),
peptide, small molecule, nucleic acid and tissue, plus a
residual other. Nothing is excluded. Identical sequences from different databases were
merged — 351 sequences are shared across ≥2 databases. (A whole/complex-target
subset, ml_dataset_complex_targets.parquet, feeds the separate representation
study.)
| Target class | Records | Unique seqs | Families | With Kd | Distinct targets |
|---|---|---|---|---|---|
| virus | 398 | 354 | 302 | 310 | 128 |
| bacterium | 219 | 203 | 168 | 170 | 70 |
| toxin | 148 | 129 | 111 | 131 | 37 |
| parasite | 44 | 36 | 34 | 22 | 14 |
| cell | 266 | 227 | 202 | 233 | 134 |
| protein | 1857 | 1581 | 1307 | 1520 | 767 |
| peptide | 59 | 56 | 56 | 51 | 27 |
| small molecule | 663 | 505 | 441 | 514 | 255 |
| nucleic acid | 12 | 12 | 12 | 8 | 3 |
| tissue | 17 | 15 | 7 | 1 | 16 |
| other | 74 | 66 | 57 | 44 | 41 |
| Total (records) | 3373 | 2782 | 2329 | 2710 | 1319 |
384 record(s) validated in more than one context carry multiple target classes and are counted once in each, so per-class rows can sum to slightly more than the 3,373 total records.
Source contribution per class — all three DBs feed every class:
Chemistry — 2297 DNA and 1076 RNA aptamers:
Sequence length (mean 66 nt, range 5–317), split by chemistry — DNA and RNA shown as separate distributions:
Binding affinity — 2710 records with a numeric Kd, spanning pM to mM:
seq_cluster_id (family-level CV), never randomly, or evaluation leaks.target_map.csv for edge cases.Generated by aptamer_benchmark/make_report.py from the pipeline
artifacts. Sources: UTexas Aptamer Database (NAR 2024), AptaDB (RNA 2024), APTABASE (IIT
Guwahati). UniProt for subcellular-location annotation.