Kế hoạch di trú dữ liệu học sinh — quy tắc lưu trữ + di trú tại chỗ

Student-data migration: establish storage rules, then validate-and-formalize the existing migration in place. Internal review surface — not yet approved.
draft — for review Area 1 · Beneficiary care no data wipe

Bối cảnh · Context

The Foundation's student data lives in a coordinator's personal Google Drive archive ("Thông tin học sinh"), outside the Foundation Shared Drive, across 3 school years — uploaded by year, then by school, with many consolidated files (a roster / KQ sheet / SE batch that covers a group of students, not one).

Over 2026-06-09/10, prior sessions already loaded 142 children into the live care store via one-off scratch scripts. That work is real but ad-hoc: the raw source still sits outside the Foundation Drive, provenance back to source documents is partial, there are known unresolved gaps (phantom students, grade-label mismatches), and there is no documented repeatable process.

This plan does two things, in order:

  1. Establish the storage rules as a canonical reference.
  2. Validate-and-formalize the existing migration in place so the live data conforms — without wiping live data (the care store is the sole copy of beneficiary data, and a known write-race makes re-disturbing it risky).

Decisions locked by the administrator

What already exists (reused, not rebuilt)

Gaps vs. the 5 rules

RuleGap
1 — two spaces (human Doc/Sheet/PDF + machine md/sqlite) Architecture exists, but raw source docs are not in the Foundation Drive and have no machine md/txt mirror.
2 — consolidated → per-student extraction + journal records entry+source; a folder for consolidated docs Extraction happened and source_ref exists, but there is no consolidated-docs Drive folder and no catalog linking a journal entry to its consolidated source file.
3 — student docs in student folder + bucket md + key details to sổ tay; reports full-review, early→latest Per-student folders exist; but no per-student source-doc subfolder, and the handbook + child report order the journal newest-first (the reverse of early→latest).
4 — school + cohort register No school/cohort register exists; reports aggregate per-child each run.
5 — migrate, reuse existing data Done ad-hoc via scratch scripts; needs one documented idempotent tool + gap reconciliation.

Phần 1 — Thiết lập quy tắc (tài liệu tham chiếu)

New rule file .claude/rules/care-data.md — the canonical, agent-facing reference for student-data storage. Codifies:

  1. Two spaces. Every input is persisted in both: the human-readable Foundation Shared Drive (as Google Doc/Sheet/Word/Excel/PDF, read-only to users) and the machine store (as .md/.txt/.sqlite).
  2. Consolidated (group) docs. Stored intact in 2_Hoc-sinh/Tai-lieu-tong-hop/ (human) + a .md/.txt extraction (machine), cataloged in a new source_doc table. Each per-student fact pulled from it is written to that student's sổ tay as a journal entry whose source_ref = the source-doc id/section. The journal records both the entry and its source.
  3. Student-specific docs. Stored in the student's folder 2_Hoc-sinh/Ho-so/<mã> — <Tên>/Tai-lieu-goc/ (human) + a machine md/txt mirror; key details → sổ tay. Reports do a full record review and present every time-series chronologically early→latest.
  4. School + cohort registers. A standing per-school and per-cohort record (a Google Doc users read + a lightweight register_entry log) that accretes high-level events (prizes, hardships, milestones, headcount/funding roll-ups) as documents arrive. Reports read the register and roll up per-student data.
  5. Migration discipline. Idempotent, dry-run-first, backups + write-race quiesce before any machine-store write; the live 142 are validated/formalized in place, never wiped.

Cross-references: a pointer in CLAUDE.md §Pointers and a short §16 "Storage & migration rules" in the beneficiary-care subplan (which stays the design-of-record).

Phần 2 — Sơ đồ thư mục chuẩn (hai không gian)

Human — Foundation Shared Drive band 2_Hoc-sinh/

2_Hoc-sinh/
  Ho-so/<mã> — <Tên>/            # per-student folder (EXISTS)
    So-tay · <mã> · <Tên>        # the handbook Doc (EXISTS)
    Tai-lieu-goc/                # NEW: this student's source docs (rule 3)
  Tai-lieu-tong-hop/             # NEW: consolidated/group source docs (rule 2)
    <school-year>/<school>/…
  So-tay-truong/<school>         # NEW: per-school register Doc (rule 4)
  So-tay-lua/<programme>-<year>  # NEW: per-cohort register Doc (rule 4)
  Bao-cao/<YYYY>/…               # generated reports (EXISTS)
  Khao-sat/<ung-vien>/…          # candidate surveys PRE-admission only
File by ownership — no method-named bands. Both OCR and survey material follow rules 2–3: a document is filed where it belongs, never in a folder named after how it was produced or collected. OCR/transcription is the human-readable rendering of a source document, filed with the doc it transcribes (per-student → that student's Tai-lieu-goc/; group sheet → Tai-lieu-tong-hop/). The Tai-lieu-so-hoa-OCR/ band is retired — its 41 Docs (all per-student) relocate into student folders. Survey report + information is student-specific selection data: once a candidate is admitted, the survey Doc + key details move into that student's Tai-lieu-goc/ + sổ tay; Khao-sat/ keeps only un-promoted candidates. The "AI transcription" fact lives on source_doc, not a folder name.

Machine — internal care store (prefix unchanged)

so-dang-ky.sqlite                # registry (the queryable shape)
Ho-so/journey-NNNN.md            # canonical profile (EXISTS)
So-phuc-loi/journey-NNNN/…       # benefits ledger (EXISTS)
Tai-lieu-vao/<week>/<journey>/…  # weekly intake inbox (EXISTS)
Tai-lieu-tong-hop/               # NEW: .md/.txt extraction of each consolidated doc
Tai-lieu-goc/journey-NNNN/       # NEW: .md/.txt mirror of student-specific docs

Phần 2b — Loại tài liệu & nguồn (lập kế hoạch ngay)

The ad-hoc loads added one document type at a time. To stop retrofitting, every source doc is typed (source_doc.doc_type) and classified by scope (per-student → rule 3 / consolidated → rule 2 / multi-scope → register, rule 4), with a routing target and a petal anchor known up front. Only the four highest-volume, recognisable types have structured extractors today (hoc-ba, so-diem, chuyen-can, hop-phu-huynh); we add a new one only where volume + shape justify it.

Sources (profile_entry.source, unchanged): school | coordinator | student | family | partner.

Per-student (rule 3 → student folder + sổ tay)

Loại tài liệudoc_typeRouting → targetCánh
Học bạ / phiếu liên lạchoc-baextractor → school_report1
Kết quả thi (vd. Đại An)ket-qua-thi NEW extractor→ school_report1
Bài viết / bài dự thibai-vietprose → profile_entry4
Đánh giá cảm xúc-xã hội (SE)se NEW extractor→ profile_entry (+school_report)3
Giấy khen / danh hiệugiay-khen NEWprose → profile_entry + registervaries
Hồ sơ khảo sát / ứng viênkhao-satsurvey → student folder on admission1
Báo cáo thăm gia đình / chọn trường hợptham-gia-dinhprose → profile_entry1
Giấy khai sinh / tùy thân (nhạy cảm)giay-toidentity → child (DoB)
Xác nhận hộ nghèo / hoàn cảnh (nhạy cảm)hoan-canhbackground + register hardship1
Thư cảm ơn / thư gia đìnhthucontext-linkvaries
Biên nhận / ảnh trao quàbien-nhan→ benefits ledger / benefit1
Giấy chuyển trường / tốt nghiệpchuyen-tot-nghiepchild status + register milestone

Consolidated / group (rule 2 → Tai-lieu-tong-hop/ + per-student extraction)

Loại tài liệudoc_typeRouting → targetCánh
Danh sách lớp / rosterroster→ child rows
Sổ điểm / bảng điểm thángso-diemextractor (multi-child) → school_report1
Bảng xếp hạng / KQ học kỳxep-hangextractor → school_report1
Báo cáo chuyên cần lớpchuyen-canextractor → school_report1
Báo cáo họp phụ huynh (nhiều em)hop-phu-huynhextractor → school_report1/3
Biên bản sinh hoạt kỹ năng hàng thángsinh-hoat-ky-nangprose → per-student entries2/3
Tham quan / trải nghiệm / doanh nghiệptham-quanprose → per-student entries6/7
Kết quả thể thao / nhảy dâythe-thaoprose → per-student entries5
Báo cáo công tác cộng đồngcong-dongprose → per-student entries8
Phiếu điểm danh / tham gia hoạt độngtham-gialist → per-student entriesvaries
Biên bản nhận tài trợ (school-level funding receipt)bien-ban-tai-troschool folder + register → fan out per-student benefit rows (funding_source links donor/finance)1

Multi-scope (rule 4 → register_entry): a school/cohort aggregate prize, an event recap, a hardship affecting a group, and the headcount/funding roll-ups — written to the school or cohort register.

Finance / donor docs — same pattern, different store. Bank statements, donation receipts, and contribution records follow the identical two-space + extract discipline but route into the donor/finance store (Areas 3 & 4: 3_Nha-tai-tro/ + 4_Tai-chinh/ + the finance registry), handled by existing finance/donor tooling, not the care store. bien-ban-tai-tro is the bridge: one school-level funding receipt creates per-student benefit rows on the care side and ties to a donor contribution on the finance side via funding_source. This plan owns only the care side of that link.

Build implication: extend the school-doc extractor with three new schemas (ket-qua-thi, se, giay-khen) + their intake routing tokens. Sensitive types (giay-to, hoan-canh, health) honour subplan §3: the source doc is stored, but identifying medical / precise-circumstance detail is not copied into the profile body — only the care-relevant fact, scoped.

Phần 2c — Kênh tiếp nhận & định tuyến (tài liệu đến từ đâu, về đâu)

Four ways a document arrives; all funnel into one classify → route → extract → file pipeline; anything unmatched goes to a triage queue, never silently filed (the flag-on-failure discipline already built).

  1. Hub upload (preferred). Staff upload a file and pick scope (student / school / cohort / candidate) + doc_type at upload time. Explicit metadata = deterministic routing. Written to both spaces + queued for extraction.
  2. Email. A care-documents mailbox/label (mirrors the existing bank-notification lane): attachments pulled → triage queue, doc_type/scope inferred (filename token + roster name-match) or set by a human.
  3. General drop folder. A catch-all care inbox swept by a watch job (mirrors ingest_watch): same triage.
  4. Form-attached uploads. When a Form allows file/photo upload, the onFormSubmit doorbell already knows the form's context (which candidate/student/event), moves the files into the matching folder + machine store tagged with that context, then queues extraction — no guessing.

Routing/association: explicit metadata (hub pick or form context) → deterministic; else a filename token (<journey-id|student_code>__<doc-type>); a consolidated doc fans out per student via the roster name-matcher. Unmatched / low-confidence / multi-child-needing-split → the triage queue (a hub screen), never silently filed.

Phần 3 — Bổ sung lược đồ (cộng thêm, idempotent)

Extend the registry migration tool — following the existing IF NOT EXISTS / additive-column discipline.

source_doc — catalog of every source file

doc_id (PK, src-YYYY-NNNN) · title · kind (consolidated|student)
doc_type (Part-2b taxonomy: hoc-ba|se|giay-khen|…) · scope_type · scope_key
school_year · drive_url (human copy) · bucket_path (machine extraction)
original_name · transcription (1 = AI/OCR) · added_at · added_by

The journal's source_ref and the school-report's source field reference doc_id.

register_entry — one lightweight log for both register kinds

reg_id (PK) · scope_type (school|cohort) · scope_key · entry_date
kind (prize|hardship|milestone|headcount|funding|note) · detail
journey_id (nullable) · source_doc_id (nullable) · by_person · created_at

Phần 4 — Thay đổi mã nguồn · Code changes

  1. Schema — add source_doc + register_entry (tables, index, helper fns); extend the selftest.
  2. Chronological early→latest (rule 3) — flip the per-student journal + transcript display to oldest-first, while leaving the "latest value" selectors used by the school/cohort reports unchanged. Touches the handbook compiler, the child-report renderer, and a shared ordering helper. The child report's default window also becomes full programme history (not a trailing year) so it is the "full record review" rule 3 asks for.
  3. Report tools & flows (rules 3–4) — beyond ordering: the child report surfaces each entry's source_refsource_doc (traceable to its source); the school + cohort reports read the standing register_entry log (prizes/hardships/milestones) and roll up per-student data; and the two register compilers get the same delivery path as reports — a one-click hub trigger + scheduled run, tracked in report_index, upserted as native Docs.
  4. Registers (rule 4) — deterministic per-school and per-cohort register Docs (headcount by status/year, lifetime funding, roster, and the register_entry event timeline), upserted to So-tay-truong/ and So-tay-lua/ via the existing Drive seam.
  5. Migration tooltools/care_migrate.py (NEW; supersedes the scratch one-offs): one documented idempotent CLI with --dry-run / --selftest and subcommands copy-source (copy the personal-Drive archive into the Foundation Drive + catalog it), backfill-provenance (link the loaded 142 to their source docs), reconcile (apply held-back fixes from a coordinator decisions manifest), build-registers (seed register_entry + emit register Docs).
  6. Tests — extend the matching tool tests for new tables, chronological order, register compile; add a migration-tool test (fake Drive + temp store).

Phần 4b — Danh mục báo cáo (đặc tả đầu ra)

From the administrator's output notes. The current tools cover part of this; the new cuts are the school-year report, the all-programme Tổng hợp report, the grade-level (khối 6–12) breakdown, and window as a first-class parameter. Note lứa (cohort = programme-entry year) ≠ năm học (academic year) — both are reportable. Window: tu-dau (lifetime) · nam (trailing 12 mo) · nam-hoc (one academic year) · latest.

Báo cáoToolWindowBreakdownContent / audience
Bài viết cho mediaArea-2 pipelineper event/studentdignified post copy · public (via /approve)
BC theo trườngreport_schooltừ đầuone schoolSL hs · tiền tài trợ · thời gian tham gia · nhận xét chung · principal/trustee
BC theo lứareport_cohorttừ đầuschool × khối 6–12+ headcount/funding/time/remarks · trustee
BC theo năm họcreport_cohort (+nam-hoc — NEW)một năm họcschool × khốias cohort · trustee
BC tổng hợpNEW report_overalltừ đầu; 1 nămall schools × all programmesas above, every programme · trustee/board
BC theo học sinhreport_childtừ đầu; 1 nămone studentfull record review · nhà tài trợ (sponsor)
BC theo học sinh (talking points)report_child (latest)latestone studentwhat's new + "Gợi ý đồng hành" · CTV (coordinator)

Build deltas: --window first-class on child/cohort (adds nam-hoc, latest); khối grouping in cohort/năm-học/overall (derive current khối from latest học bạ, already done in the banner); a thin Tổng hợp report across all programmes; the student talking-points variant = report_child latest + the already-built coordinator guidance block. "Bài viết cho media" stays the Area-2 output, listed for completeness.

Phần 5 — Thực thi di trú (giám sát, tại chỗ)

Per the known write-race discipline: quiesce before any machine-store write, back up first, verify after.

  1. Back up the registry + a Drive snapshot of 2_Hoc-sinh/.
  2. Schema migrate — run against a copy; quiesce; upload; verify the two new tables hold.
  3. copy-source --dry-run → eyeball consolidated-vs-student classification → execute; populate source_doc.
  4. backfill-provenance over the 142.
  5. reconcile against the coordinator decisions manifest (phantom-student call is the coordinator's).
  6. build-registersregister_entry + School/Cohort Docs.
  7. Regenerate all sổ tay Docs + reports with chronological order.
  8. Validate — profile validator, consistency check, every changed tool's selftest, the test suite.

Quyết định / ngoài phạm vi · Decisions / out of scope