Output is target_classes (a list). The flow is exact code in
aptamer_benchmark/04_targets.py (classify() = steps 2–6; main() = step 1 overrides +
the protein+cell co-target). Protein is decided before cell: a target that is a protein
(including a surface marker named on a cell) is protein; only non-proteins reach the cell test.
Hand-curated / double-checked decisions are on disk and traceable:
artifacts/review_aptabase_dna_rna_20.csv, artifacts/review_protein_cell_cotargets.csv.
text = target_raw + organism + general, lowercased (the words intracellular /
extracellular / antiviral / antibacterial blanked). gen = general (source entity-type
field). “in SET” = a keyword from SET is a substring of text.
is_cell_surface is a flag on protein records, not a class: set true when the protein
match is via the SURFACE set (and text is not in SECRETED), or — for AptaDB records
with a UniProt accession — when UniProt’s subcellular location is membrane/cell-surface.
is_protein = gen is protein/antibody; OR target_raw matches ^[A-Z0-9]{2,6}_[A-Z0-9]{2,6}$
(UniProt entry, e.g. EGFR_HUMAN); OR ext_id is a UniProt accession.
Protein + cell co-target (the protein → yes branch above; per-record, double-checked,
traceable to review_protein_cell_judgment.csv and review_protein_cell_cotargets.csv).
A protein whose target also names a specific cell the aptamer was raised against is decided
by the aim of the source study: if the aim was the cell → [protein, cell]; if the aim
was the protein, then by whether that protein is naturally highly expressed on that cell —
yes (the cell genuinely displays it) → [protein, cell]; no (engineered / transfected /
expression host) → protein.
- [protein, cell]: PSMA on prostate-cancer / LNCaP cells, ErbB2 in breast cancer cells,
HER2/neu via TUBO, CD33 on leukemia cells, CD28 / CD4 / 4-1BB on T cells, CD16a+c-Met
(bispecific), AGTR1 on breast cancer cells.
- kept protein (no natural high expression): LDL-R cell-expressed, AMPA in HEK-293,
MMP14-transfected 293T, OX40 (“on the surface of” — location wording).
Pathogen + protein co-label (the dotted edge from the pathogen group to the protein check, step 4).
A pathogen target that is also a protein carries both labels, e.g. [virus, protein],
[toxin, protein] — the protein test (step 4 signals: is_protein/proteinish) and the
pathogen check both apply. In code the pathogen check runs first (its keyword sets are more
reliable than the gen entity field — APTABASE e.g. mislabels aflatoxin as gen=Protein), and
protein is then added; the pathogen class stays primary, so the pathogen classes and the
complex-targets set are unchanged.
- [pathogen, protein] (331 records): viral glycoproteins (SARS-CoV-2 spike, hemagglutinin,
HIV env/pol), protein toxins (ricin, staph enterotoxin, C. diff toxin), bacterial/parasite
protein antigens, and UniProt pathogen-protein entries (SPIKE_SARS2, RICI_RICCO, …).
- stays single pathogen: non-protein whole pathogens (whole virus, whole bacteria, LPS) and
small-molecule toxins (aflatoxin, microcystin, ochratoxin).
VIRUS = virus, viral, hiv, influenza, hemagglutinin, sars, cov-2, cov2, coronavirus, hepatitis, hbv, hcv, hbsag, dengue, zika, ebola, norovirus, rotavirus, rabies, papillomavirus, hpv, gp120, gp41, nucleocapsid, spike protein, spike glycoprotein, envelope protein, rsv, h1n1, h5n1, h3n2, h7n9, vaccinia, herpes, cytomegalovirus, cmv, measles, mumps, west nile, japanese encephalitis, yellow fever, marburg, lassa, chikungunya, enterovirus, adenovirus, parvovirus, tat protein, rev protein, ns1 protein, ns3, ns5, vp40, vp35, core antigen, iridovirus
viral mnemonic = target_raw ends _…(HIV|INFA|INFB|CORO|SARS|HCV|HBV|HPV|DENV|ZIKV|EBOV|RABV|MEASV|ADE|HHV|VIRU)…
HOST (a host-organism species suffix on a UniProt entry name; a record is host when target_raw ends in one) = _HUMAN, _MOUSE, _RAT, _BOVIN (cattle), _PIG, _RABIT (rabbit), _SHEEP, _CHICK (chicken), _CANLF (dog), _HORSE, _FELCA (cat), _MACMU (rhesus macaque)
TOXIN = toxin, aflatoxin, ochratoxin, ricin, enterotoxin, botulinum, tetanus, microcystin, hemolysin, shiga, verotoxin
BACTERIUM = bacteri, escherichia, e. coli, e.coli, salmonella, staphylococ, streptococ, mycobacter, tuberculosis, listeria, campylobacter, pseudomonas, vibrio, cholerae, clostridi, bacillus, lps, lipopolysaccharide, lipoteichoic, shigella, klebsiella, anthracis, acinetobacter, enterococc, helicobacter, neisseria, proteus, serratia, yersinia, francisella, brucella, burkholderia, legionella, bordetella, haemophilus, moraxella, mycoplasma, chlamydia, rickettsia, treponema, borrelia, leptospira
PARASITE = plasmodium, malaria, leishmania, trypanosoma, toxoplasma, schistosoma, giardia, entamoeba, babesia, cryptosporidium
CELL_LINE = cell line, cell-line, “cells” (whole word), ccrf-cem, mcf7, mcf-7, ramos, a549, hela, hek293, hek-293, jurkat, k562, pc3, pc-3, du145, hepg2, sk-br-3 (and skbr3), hct116, hct-116, ht29, ht-29, mda-mb, u87, u-87, u118, caco2, caco-2, 4t1, b16, kato, hl60, hl-60, nb4, raw264, raw 264, macrophage, monocyte, fibroblast, hepatocyte, erythrocyte, platelet, carcinoma, leukemia, leukaemia, lymphoma, “cancer” (whole word), tumor, tumour, glioma, glioblastoma, melanoma, adenocarcinoma
PROTEIN_INDICATOR = antigen, receptor, protein, kinase, glycoprotein, enzyme, “factor” (whole word), globulin, albumin, mucin, integrin, selectin, cadherin, antibody, interleukin, cytokine
SURFACE = receptor, cd4, cd28, cd30, cd44, cd63, cd133, cd antigen, egfr, her2, erbb2, epcam, psma, ptk7, muc1, mucin, transferrin receptor, tfr, integrin, pdgfr, vegfr, nucleolin, tenascin, pd-l1, pd-1, pdl1, ctla, neuropilin, nrp1, l-selectin, e-selectin, icam, vcam, membrane protein, cell surface, cell-surface, glypican, mesothelin, cadherin, her-2, carcinoembryonic, gpc3, epithelial cell adhesion
SECRETED = secreted, vegf, tnf, pdgf, fgf, interleukin, il-
ENZYME = polymerase, phosphatase, dehydrogenase, kinase, elastase, protease, ribonuclease, deoxyribonuclease, nuclease, ligase, reductase, oxidase, transferase, hydrolase, esterase, lipase, amylase, peroxidase, isomerase, synthetase, synthase, lactamase, gyrase, topoisomerase, helicase, transcriptase, caspase, convertase, secretase, neuraminidase, sialidase, catalase, luciferase, galactosidase, glucosidase, glucuronidase, invertase, aldolase, enolase, lysozyme
NAMED_PROTEIN = thrombin, streptavidin, avidin, lactoferrin, leptin, fibrinogen, fibronectin, immunoglobulin, ig[e/a/m/g], prion, prp/prpc/prps, gfp, green fluorescent, mammaglobin, colicin, integration host factor, ihf, abf1, b52, unr, gelsolin, calmodulin, ferritin, transferrin, c-reactive, crp, alpha-fetoprotein, afp, prostate specific, psa, keratinocyte growth, urokinase, plau, plasmin, allergen, ara h, zinc finger, wt1, ptp1b, streptokinase, cytochrome, hemoglobin, myoglobin, serum albumin, bsa, hsa, ovalbumin, casein, connective tissue growth,
NUCLEIC_KW = trna, rrna, mrna, mirna, sirna, sgrna, snrna, ribosomal rna, 16s, 23s, “5s “, oligonucleotide, ribozyme, riboswitch, let-7, g-quadruplex, i-motif dna, duplex dna, hairpin dna, cdna, plasmid, telomer, stem-loop, stem loop, rna stem, dna target
PEPTIDE_KW = peptide, anepiii, amyloid, substance p, bradykinin, angiotensin, gnrh, oxytocin, vasopressin, melittin, gliadin, glucagon, natriuretic, casomorphin, histatin, gonadoliberin, calcitonin gene-related
SMALL_MOL_KW = atp, gtp, ctp, utp, ” adp”, ” amp”, camp, cgmp, cyclic di, adenosine, guanosine, cytidine, uridine, inosine, xanthine, theophylline, caffeine, s-adenosyl, sialyllactose, lactose, glucose, sucrose, fructose, maltose, neu5gc, n-glycolylneuraminic, sialic acid, n-acetyl, glucosamine, chitin, sphingosylphosphorylcholine, sphingosine, cholesterol, phosphatidyl, cholic acid, tetracyclin, oxytetracyclin, doxycyclin, doxycyline, tobramycin, streptomycin, kanamycin, neomycin, gentamicin, ampicillin, penicillin, chloramphenicol, ciprofloxacin, fluoroquinolone, norfloxacin, ofloxacin, enrofloxacin, sulfadimethoxine, sulfamethazine, sulfamethoxazole, vancomycin, daunomycin, daunorubicin, doxorubicin, diclofenac, ibuprofen, acetaminophen, paracetamol, cocaine, methamphetamine, amphetamine, morphine, codeine, ketamine, estradiol, estrogen, estriol, estrone, testosterone, progesterone, cortisol, corticosterone, dexamethasone, dopamine, serotonin, histamine, melatonin, thyroxine, adrenaline, epinephrine, norepinephrine, noradrenaline, sulforhodamine, malachite green, rhodamine, fluorescein, hoechst, cyanine, tetramethylrosamine, dfhbi, fluorophore, cy3, cy5, bisphenol, atrazine, acetamiprid, glyphosate, profenofos, carbendazim, diazinon, patulin, zearalenone, fumonisin, deoxynivalenol, citrinin, hematoporphyrin, porphyrin, biotin, digoxin, digoxigenin, ethanolamine, vitamin, folate, folic acid, arsen, mercury, cadmium, lead ion, potassium ion, tyrosinamide, l-arginine, l-dopa, melamine, urea, creatinine, acetylcholine, tartrazine, glutathione, glutamate, glutamic acid, cellobiose, cobinamide, moenomycin, tacrolimus, methylenedianiline, glucan, palladium, methamidophos, versicolorin, callose, sialyl lewis, aminoglycoside, lividomycin, ractopamine, reactive blue, sephadex, sodium ion, tetra-bde, lewis x, tyrosine, tryptophan, serine, citrulline, okadaic, nonylphenol, trichloro, ddt, adenine, sulfoxylate, coedine, “(pb)”
TISSUE_KW = tissue, biopsy, whole blood, saliva
Converted from classification_decision_tree.md. Classification logic lives in aptamer_benchmark/04_targets.py; the diagram renders offline via inlined mermaid.