Curating public aptamer databases for downstream analysis

Integration of three curated aptamer databases into one clean, fully-categorized dataset of aptamers across all target categories — the dataset behind the interactive neighbour explorer. Generated from aptamer_benchmark/.

source DBs

3,373

curated records

2,782

unique sequences

2,329

sequence families

2,710

with Kd

1. Source databases

3 public, manually-curated aptamer collections were combined. They differ in size, accessibility and licensing (record counts are this dataset's contribution from each, summing to 3,373):

Database	Records	Bulk access	License
UTexas Aptamer DB (Ellington lab)	1,495	Zenodo xlsx	CC BY 3.0 US
AptaDB (ECUST)	1,350	scraped (no API)	academic/unclear
APTABASE (IIT Guwahati)	528	scraped	academic/unclear

UTexas is the most permissively licensed and richest (pre-normalized Kd, DNA/RNA type, buffer/pH); AptaDB carries UniProt accessions enabling principled target typing; APTABASE needed encoding clean-up. Provenance and a redistributable flag are kept per row.

2. Cleaning & integration pipeline

Ingest all three into one long schema (one row per aptamer-affinity record) with provenance.
Clean sequences: fix mojibake, strip 5'/3' markers, uppercase, U→T canonical key; derive DNA/RNA and a chemical-modification flag; recompute length & GC.
Normalize affinity to a numeric Kd in nM (pM/µM/mM converted); flag approx/range; skip association constants (Ka).
Categorize targets by rule: pathogens (virus/bacterium/toxin/parasite, host-gated) → whole cells → cell-surface proteins (AptaDB via UniProt subcellular location; others via surface markers) → drop the rest.
Deduplicate: exact canonical key collapses identical aptamers across DBs (keeping all affinity records); containment clustering merges primer-trimmed / truncation variants into sequence families.
Export the curated dataset + datasheet.

From 3,455 ingested records, 3,373 had a usable sequence and were categorized into 11 target classes (nothing dropped); these collapse to 2,782 unique sequences in the neighbour explorer (see §3).

3. The dataset for the neighbour explorer

The interactive neighbour explorer is built from artifacts/ml_dataset_all_targets.parquet: the integrated, de-duplicated records across all 11 target classes — whole / complex targets virus, bacterium, toxin, parasite (whole pathogens & toxins) and cell (whole-cell SELEX), purified / molecular targets protein (surface receptors flagged is_cell_surface), peptide, small molecule, nucleic acid and tissue, plus a residual other. Nothing is excluded. Identical sequences from different databases were merged — 351 sequences are shared across ≥2 databases. (A whole/complex-target subset, ml_dataset_complex_targets.parquet, feeds the separate representation study.)

4. Characterization of the curated data

Target class	Records	Unique seqs	Families	With Kd	Distinct targets
virus	398	354	302	310	128
bacterium	219	203	168	170	70
toxin	148	129	111	131	37
parasite	44	36	34	22	14
cell	266	227	202	233	134
protein	1857	1581	1307	1520	767
peptide	59	56	56	51	27
small molecule	663	505	441	514	255
nucleic acid	12	12	12	8	3
tissue	17	15	7	1	16
other	74	66	57	44	41
Total (records)	3373	2782	2329	2710	1319

384 record(s) validated in more than one context carry multiple target classes and are counted once in each, so per-class rows can sum to slightly more than the 3,373 total records.

Source contribution per class — all three DBs feed every class:

Chemistry — 2297 DNA and 1076 RNA aptamers:

Sequence length (mean 66 nt, range 5–317), split by chemistry — DNA and RNA shown as separate distributions:

length distribution by chemistry (DNA vs RNA)

Binding affinity — 2710 records with a numeric Kd, spanning pM to mM:

5. Caveats for downstream modeling

Redundancy: 2782 sequences fall into 2329 families — always split by seq_cluster_id (family-level CV), never randomly, or evaluation leaks.
Positives only: these are published binders with ~no true negatives — any classifier needs negative sampling (shuffled / cross-target decoys).
Heterogeneous Kd: values come from different methods/buffers — compare within method; prefer ranking / strong-vs-weak over absolute regression.
Rule-based categorization: AptaDB protein typing uses UniProt; UTexas/APTABASE free-text uses keywords (less precise) — audit target_map.csv for edge cases.

Generated by aptamer_benchmark/make_report.py from the pipeline artifacts. Sources: UTexas Aptamer Database (NAR 2024), AptaDB (RNA 2024), APTABASE (IIT Guwahati). UniProt for subcellular-location annotation.