diff --git a/context/hash-links-implementation-plan.md b/context/hash-links-implementation-plan.md new file mode 100644 index 0000000..2d2ad7b --- /dev/null +++ b/context/hash-links-implementation-plan.md @@ -0,0 +1,589 @@ +# Hash Links Implementation Plan + +**Issue:** #93 - Hash links are simply skipped and are no links at all during rendering +**Date:** 2025-11-27 +**Status:** Proposal - Awaiting Team Decision + +--- + +## Executive Summary + +This document presents multiple implementation approaches for fixing hash links in the Notion-to-Docusaurus pipeline. Each approach has trade-offs in complexity, maintainability, and integration with the existing architecture. + +**Recommended Approach:** Hybrid Strategy (Approach D) +- Combines enhanced content sanitization with a Docusaurus remark plugin +- Follows existing architectural patterns (`remark-fix-image-paths`) +- Provides clean separation of concerns +- Most maintainable long-term solution + +--- + +## Architecture Overview + +### Current Processing Pipeline + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Notion API │ +│ • Blocks with link_to_page references │ +│ • Page mentions (@Page Name) │ +│ • Block IDs for anchors │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ notion-to-md v3.1.9 │ +│ • pageToMarkdown() - converts blocks │ +│ • toMarkdownString() - generates markdown │ +│ • OUTPUT: Malformed tags like │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ contentSanitizer.ts │ +│ • Fixes malformed HTML/JSX tags │ +│ • PROBLEM: Discards actual link information │ +│ • Converts to placeholder: [link](#) │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ Docusaurus Build │ +│ • Processes markdown files │ +│ • Generates routes from frontmatter slugs │ +│ • RESULT: Broken or missing hash links │ +└─────────────────────────────────────────────────────────────┘ +``` + +--- + +## Approach A: Custom Post-Processing (Original Plan) + +### Description + +Add link rewriting during Notion fetch in `generateBlocks.ts`. + +### Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Phase 1: Build Mappings │ +│ • Notion Page ID → Local Slug │ +│ • Notion Block ID → Readable Anchor Name │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ Phase 2: Three-Pass Processing │ +│ Pass 1: Collect all page ID → slug mappings │ +│ Pass 2: Process all blocks → build anchor mappings │ +│ Pass 3: Rewrite links with complete mapping data │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ Phase 3: Link Rewriting │ +│ • Parse markdown links │ +│ • Extract Notion URLs and block IDs │ +│ • Convert to local paths: /docs/slug#anchor │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Implementation Files + +**New Files:** +``` +scripts/notion-fetch/ +├── linkMapper.ts # Mapping system (page & block IDs) +├── linkRewriter.ts # Notion URL → local path conversion +├── blockAnchorBuilder.ts # Extract blocks & generate anchors +└── anchorSlugifier.ts # Slugify text to match Docusaurus +``` + +**Modified Files:** +``` +scripts/notion-fetch/ +└── generateBlocks.ts # Integrate link processing +``` + +### Pros +- ✅ Full control over link conversion +- ✅ Can handle all edge cases +- ✅ Works during Notion fetch (offline afterward) +- ✅ No dependency on Docusaurus build process + +### Cons +- ❌ Complex three-pass processing required +- ❌ Custom code outside standard Docusaurus patterns +- ❌ Harder to debug and maintain +- ❌ Processes ALL pages even if Docusaurus doesn't need them +- ❌ Manual cache management required +- ❌ Doesn't leverage Docusaurus link resolution + +### Estimated Effort +- **Development:** 11-16 hours +- **Testing:** 3-4 hours +- **Total:** 14-20 hours + +--- + +## Approach B: Docusaurus Remark Plugin + +### Description + +Create a remark plugin that runs during Docusaurus build to transform links. + +### Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Notion Fetch (generateBlocks.ts) │ +│ • Generate pages with enhanced metadata │ +│ • Export page ID mappings to JSON file │ +│ • Preserve Notion URLs in markdown temporarily │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ Docusaurus Build │ +│ • Reads markdown files │ +│ • Applies remark plugins in order │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ remark-notion-links.ts (NEW) │ +│ • Traverses markdown AST │ +│ • Finds link nodes with Notion URLs │ +│ • Loads page ID mappings from JSON │ +│ • Converts URLs to local paths with anchors │ +│ • Leverages Docusaurus heading data │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ Final Markdown │ +│ • All links converted to local paths │ +│ • Hash anchors match Docusaurus heading IDs │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Implementation Files + +**New Files:** +``` +scripts/ +├── remark-notion-links.ts # Remark plugin for link transformation +└── notion-link-mappings.json # Generated page/block ID mappings +``` + +**Modified Files:** +``` +docusaurus.config.ts # Add new remark plugin +scripts/notion-fetch/ +├── generateBlocks.ts # Export link mappings to JSON +└── contentSanitizer.ts # Preserve Notion URLs (don't discard) +``` + +### Pros +- ✅ Follows existing architectural pattern (`remark-fix-image-paths`) +- ✅ Integrates cleanly with Docusaurus +- ✅ Can leverage Docusaurus heading TOC data +- ✅ Only processes pages Docusaurus needs +- ✅ Uses Docusaurus caching automatically +- ✅ Easier to debug (part of standard build) +- ✅ More maintainable long-term + +### Cons +- ❌ Requires mapping data to be exported/loaded +- ❌ Two-stage process (Notion fetch + Docusaurus build) +- ❌ Slightly more complex setup +- ❌ Need to handle stale mapping data + +### Estimated Effort +- **Development:** 8-10 hours +- **Testing:** 2-3 hours +- **Total:** 10-13 hours + +--- + +## Approach C: Upstream Fix (Wait for notion-to-md) + +### Description + +Contribute to notion-to-md Issue #161 or wait for upstream fix. + +### Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Contribute to notion-to-md │ +│ • Fork notion-to-md repository │ +│ • Implement hash link support │ +│ • Submit PR to upstream │ +│ • Wait for merge and release │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ Upgrade notion-to-md │ +│ • Update package.json to new version │ +│ • Test with existing content │ +│ • Remove temporary workarounds │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Pros +- ✅ Benefits entire community +- ✅ Proper long-term solution +- ✅ Reduces custom code in this project +- ✅ Maintained by upstream + +### Cons +- ❌ Uncertain timeline (could be months) +- ❌ May not match exact requirements +- ❌ Need temporary workaround anyway +- ❌ Dependency on external maintainers +- ❌ High priority requirement (can't wait) + +### Estimated Effort +- **Upstream contribution:** 20-30 hours +- **Integration:** 2-4 hours +- **Timeline:** 2-6 months (uncertain) + +--- + +## Approach D: Hybrid Strategy (RECOMMENDED) + +### Description + +Combine enhanced content sanitization with a remark plugin for best of both worlds. + +### Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Phase 1: Enhanced Content Sanitizer │ +│ • Extract page/block IDs from malformed tags │ +│ • Convert to markdown with data-notion-* attributes │ +│ • Example: [link](notion://page-id#block-id) │ +│ • Preserve link information for later processing │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ Phase 2: Export Mappings │ +│ • Generate page ID → slug mappings │ +│ • Generate block ID → anchor mappings │ +│ • Export to scripts/notion-link-mappings.json │ +└─────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────┐ +│ Phase 3: Remark Plugin (Docusaurus Build) │ +│ • Load mappings from JSON │ +│ • Transform notion:// URLs to local paths │ +│ • Generate readable anchors from block content │ +│ • Validate links and report broken references │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Implementation Files + +**New Files:** +``` +scripts/ +├── remark-notion-links.ts # Remark plugin +├── notion-fetch/ +│ ├── linkMappingExporter.ts # Export mappings to JSON +│ └── anchorSlugifier.ts # Slugify headings +└── notion-link-mappings.json # Generated mappings +``` + +**Modified Files:** +``` +scripts/notion-fetch/ +├── contentSanitizer.ts # Enhanced link extraction +├── generateBlocks.ts # Export mappings +└── contentSanitizer.test.ts # Updated tests + +docusaurus.config.ts # Add remark plugin +``` + +### Implementation Phases + +#### **Phase 1: Enhanced Sanitizer (2-3 hours)** +```typescript +// contentSanitizer.ts +function extractNotionLink(malformedTag: string): { + pageId?: string; + blockId?: string; + text: string; +} { + // Extract IDs from malformed tags + // Return structured data +} + +// Convert malformed tags to temporary format +content = content.replace( + /]+)>/gi, + (match, linkInfo) => { + const { pageId, blockId, text } = extractNotionLink(linkInfo); + return `[${text}](notion://${pageId}${blockId ? '#' + blockId : ''})`; + } +); +``` + +#### **Phase 2: Mapping Exporter (2-3 hours)** +```typescript +// linkMappingExporter.ts +interface LinkMappings { + pages: Record; // pageId → slug + blocks: Record; // blockId → anchor + version: string; // Cache version + generated: string; // Timestamp +} + +export function exportLinkMappings( + pages: PageData[], + outputPath: string +): void { + // Build mappings from processed pages + // Write to JSON file +} +``` + +#### **Phase 3: Remark Plugin (4-5 hours)** +```typescript +// remark-notion-links.ts +import { visit } from 'unist-util-visit'; +import type { Plugin } from 'unified'; + +const remarkNotionLinks: Plugin = () => { + const mappings = loadMappings(); + + return (tree) => { + visit(tree, 'link', (node) => { + if (node.url.startsWith('notion://')) { + const { pageId, blockId } = parseNotionUrl(node.url); + const slug = mappings.pages[pageId]; + const anchor = blockId ? mappings.blocks[blockId] : ''; + + if (slug) { + node.url = `/docs/${slug}${anchor ? '#' + anchor : ''}`; + } else { + // Warn about broken link + console.warn(`Unknown page ID: ${pageId}`); + } + } + }); + }; +}; + +export default remarkNotionLinks; +``` + +### Pros +- ✅ Clean separation of concerns +- ✅ Follows existing architectural patterns +- ✅ Preserves link information through pipeline +- ✅ Easier to debug (clear data flow) +- ✅ Can add validation and error reporting +- ✅ Extensible for future enhancements +- ✅ Leverages Docusaurus ecosystem +- ✅ Incremental implementation possible + +### Cons +- ❌ Slightly more complex than single approach +- ❌ Need to maintain mapping file format +- ❌ Two-stage processing + +### Estimated Effort +- **Phase 1:** 2-3 hours +- **Phase 2:** 2-3 hours +- **Phase 3:** 4-5 hours +- **Testing:** 2-3 hours +- **Total:** 10-14 hours + +--- + +## Comparison Matrix + +| Criteria | Approach A | Approach B | Approach C | Approach D | +|----------|-----------|-----------|-----------|-----------| +| **Complexity** | High | Medium | Low (wait) | Medium | +| **Maintainability** | Medium | High | High | High | +| **Integration** | Custom | Standard | Standard | Standard | +| **Timeline** | 2-3 weeks | 1-2 weeks | 2-6 months | 1-2 weeks | +| **Flexibility** | High | Medium | Low | High | +| **Debug Ease** | Medium | High | N/A | High | +| **Performance** | Medium | High | High | High | +| **Long-term Cost** | High | Medium | Low | Medium | +| **Risk Level** | Medium | Low | High | Low | + +--- + +## Recommendation: Approach D (Hybrid Strategy) + +### Why This Approach? + +1. **Follows Existing Patterns** + - Already using remark plugins (`remark-fix-image-paths`) + - Team familiar with this architecture + - Standard Docusaurus approach + +2. **Clean Architecture** + - Separation of concerns (sanitize → map → transform) + - Clear data flow through pipeline + - Easy to understand and maintain + +3. **Extensible** + - Can add link validation + - Can add broken link reporting + - Can add link analytics + - Can add custom link transformations + +4. **Reasonable Effort** + - 10-14 hours total (vs 14-20 for Approach A) + - Incremental implementation possible + - Can deliver MVP faster + +5. **Low Risk** + - Follows proven patterns + - Easy to debug + - Easy to rollback if needed + +### Future Path + +After implementing Approach D, consider: +- **Contributing to notion-to-md** (Approach C) for long-term upstream fix +- **Removing workaround** when upstream support is available +- **Packaging remark plugin** as standalone package for community + +--- + +## Implementation Roadmap + +### Week 1: Investigation & Phase 1 +- [ ] **Day 1-2:** Investigation (verify notion-to-md output formats) +- [ ] **Day 3:** Enhanced content sanitizer +- [ ] **Day 4:** Unit tests for sanitizer +- [ ] **Day 5:** Code review and refinement + +### Week 2: Phase 2 & 3 +- [ ] **Day 1-2:** Mapping exporter implementation +- [ ] **Day 3-4:** Remark plugin implementation +- [ ] **Day 5:** Integration and testing + +### Week 3: Validation & Deployment +- [ ] **Day 1-2:** Test with real Notion content +- [ ] **Day 3:** Performance testing and optimization +- [ ] **Day 4:** Documentation and team training +- [ ] **Day 5:** Deploy to preview environment + +### Week 4: Monitoring & Refinement +- [ ] Monitor for edge cases +- [ ] Fix any issues discovered +- [ ] Gather team feedback +- [ ] Plan enhancements + +--- + +## Risk Mitigation + +### Risk 1: Stale Mapping Data +**Mitigation:** +- Add version tracking to mapping file +- Regenerate on any Notion fetch +- Add validation checks in remark plugin + +### Risk 2: Unknown Notion URL Formats +**Mitigation:** +- Comprehensive investigation phase first +- Robust parsing with fallbacks +- Clear error messages for unsupported formats + +### Risk 3: Performance Impact +**Mitigation:** +- Mapping load is O(1) per page +- Docusaurus caching handles rebuild optimization +- Monitor build times before/after + +### Risk 4: I18n Edge Cases +**Mitigation:** +- Research Docusaurus i18n + hash behavior first +- Test with multi-language pages +- Document i18n-specific behavior + +--- + +## Testing Strategy + +### Unit Tests +- [ ] Content sanitizer link extraction +- [ ] Mapping exporter output format +- [ ] Anchor slugification matches Docusaurus +- [ ] Remark plugin link transformation +- [ ] Edge cases (malformed URLs, missing mappings) + +### Integration Tests +- [ ] Full pipeline (Notion → Sanitizer → Mappings → Remark → Docusaurus) +- [ ] Cross-page links work correctly +- [ ] Same-page hash links work correctly +- [ ] External links remain unchanged +- [ ] Multi-language pages + +### Manual Tests +- [ ] Create test Notion pages with all link types +- [ ] Run full build pipeline +- [ ] Verify links work in dev server +- [ ] Test on preview deployment +- [ ] Check browser console for errors + +--- + +## Success Metrics + +### Functional +- [ ] 100% of same-page hash links work +- [ ] 100% of cross-page links work +- [ ] 100% of cross-page + hash links work +- [ ] 0 regressions in existing links +- [ ] External links unchanged + +### Non-Functional +- [ ] Build time impact < 10% +- [ ] No memory issues +- [ ] Clear error messages for broken links +- [ ] Documentation complete +- [ ] Team training delivered + +--- + +## Open Issues for Team Discussion + +### 1. Error Handling Strategy +**Question:** What should happen when a link references a non-existent page or block? + +**Options:** +- A. Fail the build (strict) +- B. Warn and keep original URL (graceful) +- C. Convert to broken link marker (explicit) + +**Recommendation:** Option A for development, Option B for production + +### 2. Caching Strategy +**Question:** Should link mappings be committed to git or generated on each build? + +**Options:** +- A. Generate every time (slower but always fresh) +- B. Commit to git (faster but can be stale) +- C. Hybrid: cache with validation + +**Recommendation:** Option A (regenerate on Notion fetch) + +### 3. I18n Behavior +**Question:** How should hash links work across languages? + +**Scenarios:** +- Portuguese page links to English heading +- Should it go to Portuguese version? +- Should anchors be translated? + +**Action Required:** Research and team decision + +--- + +**Document Version:** 1.0 +**Last Updated:** 2025-11-27 +**Status:** Proposal - Awaiting Team Review +**Next Steps:** Team review meeting to discuss approach selection diff --git a/context/hash-links-specification.md b/context/hash-links-specification.md new file mode 100644 index 0000000..22a3c28 --- /dev/null +++ b/context/hash-links-specification.md @@ -0,0 +1,266 @@ +# Hash Links Specification + +**Issue:** #93 - Hash links are simply skipped and are no links at all during rendering +**Priority:** High +**Status:** Investigation Complete - Awaiting Implementation Decision +**Date:** 2025-11-27 + +--- + +## Problem Statement + +Notion allows users to create links to specific blocks (headings, paragraphs, etc.) within a page. These links include a block ID as a hash fragment in the URL (e.g., `https://notion.so/page-id#block-abc123`). Currently, these hash links are either stripped during conversion or converted to placeholder links, resulting in broken navigation in the generated documentation. + +### User Requirements + +1. **Readable anchor names**: Hash links should use human-readable names like `#about-this-guide`, not block IDs like `#block-abc123` +2. **Cross-page support**: Links should work for both same-page references and cross-page references +3. **Same-page support**: Links to headings within the same page should work correctly + +### Expected Behavior + +**Before (Current):** +```markdown +Check the [installation guide](https://notion.so/Installation-Guide-abc123#block-def456) for details. +``` +→ Broken link or points to Notion.so + +**After (Desired):** +```markdown +Check the [installation guide](/docs/installation-guide#prerequisites) for details. +``` +→ Works locally, navigates to correct section + +--- + +## Research Findings + +### 1. Current Link Processing Pipeline + +``` +Notion API (raw blocks with link_to_page and mentions) + ↓ +notion-to-md v3.1.9: pageToMarkdown() + ↓ +notion-to-md: toMarkdownString() + → Outputs MALFORMED HTML/JSX tags: , + ↓ +contentSanitizer.ts: sanitizeMarkdownContent() + → Converts to placeholder links: [link](#) + ↓ +Docusaurus build + → Renders as links, but destinations are broken +``` + +### 2. Key Discovery: Malformed Tag Output + +The `notion-to-md` library (v3.1.9) does **not** output standard markdown links for Notion page references. Instead, it outputs malformed HTML/JSX-like tags: + +**Examples from current sanitizer:** +- `` → Currently converted to `[link to section](#section)` +- `` → Currently converted to `[link](#)` +- `` → Currently converted to `[Link](#)` + +**Current sanitizer behavior (contentSanitizer.ts:107-117):** +```typescript +// Discards actual link information! +content = content.replace( + //gi, + "[link to section](#section)" +); + +content = content.replace(/]*[^\w\s"=-][^>]*>/g, "[link](#)"); +content = content.replace(/]*[^\w\s"=-][^>]*>/g, "[Link](#)"); +``` + +### 3. Cross-Page Linking Status + +**Current State:** +- ✅ Page titles are converted to kebab-case slugs (e.g., "Getting Started" → `getting-started`) +- ✅ Frontmatter includes `slug: /getting-started` for Docusaurus routing +- ✅ Multi-language pages share the same slug across languages +- ✅ Page metadata cache tracks Notion page IDs → output file paths + +**Missing:** +- ❌ No link rewriting system to convert Notion page URLs to local doc URLs +- ❌ No hash link support for block ID anchors +- ❌ No block ID → readable anchor mapping + +### 4. Upstream Library Status + +**notion-to-md v3.1.9:** +- GitHub Issue #161: "Add support for 'Link to block' to URL hash conversion" +- Status: Open (as of 2025-07-21) +- Current behavior: Strips out hash/block ID information during conversion +- No timeline for fix + +**Reference:** https://github.com/souvikinator/notion-to-md/issues/161 + +--- + +## Technical Requirements + +### 1. Link Types to Support + +| Link Type | Example | Current Behavior | Desired Behavior | +|-----------|---------|------------------|------------------| +| Same-page hash | `#block-id` | Stripped | `#section-name` | +| Cross-page | `https://notion.so/page-id` | Notion URL | `/docs/page-slug` | +| Cross-page + hash | `https://notion.so/page-id#block-id` | Notion URL or broken | `/docs/page-slug#section` | +| External | `https://example.com` | Works ✅ | Keep unchanged ✅ | + +### 2. Anchor Name Generation + +**Requirements:** +- Must match Docusaurus's heading anchor generation algorithm +- Must be human-readable (slugified from heading text) +- Must handle duplicates (append `-2`, `-3`, etc.) +- Must work across all languages (i18n) + +**Docusaurus Anchor Algorithm:** +```typescript +function slugify(text: string): string { + return text + .toLowerCase() + .trim() + .replace(/\s+/g, '-') // spaces to hyphens + .replace(/[^\w\-]+/g, '') // remove special chars + .replace(/\-\-+/g, '-') // collapse multiple hyphens + .replace(/^-+/, '') // trim leading hyphens + .replace(/-+$/, ''); // trim trailing hyphens +} +``` + +### 3. Notion URL Formats + +Must handle multiple Notion URL variants: +``` +https://notion.so/page-id +https://notion.so/Page-Title-page-id +https://notion.so/workspace/page-id +https://notion.so/page-id?p=page-id#block-id +``` + +### 4. I18n Considerations + +**Multi-language behavior:** +- Pages share slugs across languages: `/docs/page` (English) and `/pt/docs/page` (Portuguese) +- Question: Should hash links be language-aware? +- Question: Do block IDs differ across language variants? + +**Needs research:** How Docusaurus handles i18n + hash link navigation + +--- + +## Open Questions + +### 1. Notion Link Format Investigation + +**Need to verify:** +- What exactly does notion-to-md output for different link types? +- Are same-page hash links handled differently than cross-page links? +- How are `link_to_page` block types represented? +- How are page mentions (`@Page Name`) represented? + +**Action:** Create test Notion page with all link types and run notion-to-md to see actual output + +### 2. Block Content Availability + +**Question:** Do we already have block content during markdown generation? + +**Investigation findings:** +- Yes, blocks are already fetched via `fetchNotionBlocks()` in `generateBlocks.ts` +- No additional API calls needed for block content +- Block tree structure is available for traversal + +### 3. Error Handling Strategy + +**Scenarios:** +- Link references non-existent page +- Block ID doesn't exist or is deleted +- Malformed Notion URL + +**Options:** +- **Strict:** Fail the build with clear error +- **Graceful:** Warn and keep original URL +- **Explicit:** Convert to marked broken link (e.g., `[link ⚠️](#broken)`) + +**Decision needed:** Team preference for error handling approach + +### 4. Caching Strategy + +**Questions:** +- Should block anchor mappings be persisted between builds? +- How to handle incremental sync with link changes? +- How to detect and clean up broken links? + +**Current state:** Page metadata cache exists (`pageMetadataCache.ts`), could be extended + +--- + +## Related Documentation + +### Codebase Files +- `scripts/notion-fetch/contentSanitizer.ts` - Current link sanitization (lines 107-117) +- `scripts/notion-fetch/generateBlocks.ts` - Page processing and slug generation (lines 582-593) +- `scripts/notion-fetch/pageMetadataCache.ts` - Page ID to file path mapping +- `scripts/remark-fix-image-paths.ts` - Example remark plugin architecture +- `docusaurus.config.ts` - Remark plugin configuration (line 283) + +### External Resources +- [Notion Help: Links & Backlinks](https://www.notion.com/help/create-links-and-backlinks) +- [notion-to-md Issue #161](https://github.com/souvikinator/notion-to-md/issues/161) +- [Docusaurus MDX Plugins](https://docusaurus.io/docs/markdown-features/plugins) +- [Docusaurus Hash-Links Issue #11358](https://github.com/facebook/docusaurus/issues/11358) +- [Super.so: Anchor Links Guide](https://help.super.so/en/articles/6388730-how-to-link-to-a-part-of-a-page-anchor-links) + +--- + +## Success Criteria + +### MVP (Minimum Viable Product) +- [ ] Same-page hash links work with readable anchor names +- [ ] Cross-page links convert to local doc paths +- [ ] Cross-page + hash links work correctly +- [ ] External links remain unchanged +- [ ] No regressions in existing link behavior + +### Nice to Have +- [ ] Broken link detection and reporting +- [ ] Link validation during build +- [ ] I18n-aware hash link routing +- [ ] Cached mappings for faster incremental builds +- [ ] Migration tool for existing content + +--- + +## Next Steps + +1. **Phase 1: Investigation** (1-2 hours) + - Create test Notion pages with various link types + - Run notion-to-md to document actual output formats + - Document findings in technical investigation document + +2. **Phase 2: Architecture Decision** (Team review) + - Review implementation plan options + - Decide on error handling strategy + - Decide on caching approach + - Choose implementation approach + +3. **Phase 3: Implementation** (6-12 hours, depending on approach) + - Implement chosen solution + - Write comprehensive tests + - Update documentation + +4. **Phase 4: Validation** (2-3 hours) + - Test with real Notion content + - Verify links work in Docusaurus + - Performance testing + - Deploy to preview environment + +--- + +**Document Version:** 1.0 +**Last Updated:** 2025-11-27 +**Authors:** Claude (AI Assistant) +**Reviewers:** _Pending team review_ diff --git a/context/hash-links-summary.md b/context/hash-links-summary.md new file mode 100644 index 0000000..02d5f5d --- /dev/null +++ b/context/hash-links-summary.md @@ -0,0 +1,287 @@ +# Hash Links - Quick Summary + +**Issue:** #93 - Hash links are simply skipped and are no links at all during rendering +**Priority:** High +**Status:** Investigation Complete - Ready for Implementation +**Date:** 2025-11-27 + +--- + +## The Problem in 30 Seconds + +Notion allows linking to specific sections within pages using hash anchors (e.g., `https://notion.so/page#section`). Currently, these links are either broken or stripped during conversion to markdown, resulting in poor documentation navigation. + +**User wants:** +- Links like `/docs/installation-guide#prerequisites` that work +- Both same-page and cross-page hash links +- Readable anchor names (not block IDs) + +--- + +## Key Findings + +### 1. notion-to-md Outputs Malformed Tags ⚠️ + +The library doesn't output standard markdown links. Instead: +```html + +[link](#) +``` + +**Problem:** We're discarding the actual link information! + +### 2. No Cross-Page Link Rewriting + +There's no system to convert: +``` +https://notion.so/page-id → /docs/page-slug +``` + +### 3. We Already Use Remark Plugins ✅ + +The project already uses remark plugins for transformations: +```typescript +remarkPlugins: [remarkFixImagePaths] // docusaurus.config.ts:283 +``` + +This is the right architecture for our solution. + +--- + +## Recommended Solution + +**Approach: Hybrid Strategy (Remark Plugin + Enhanced Sanitizer)** + +### Phase 1: Enhanced Sanitizer +Extract link information instead of discarding it: +```typescript +// OLD: → [link](#) +// NEW: → [link](notion://page-id#block-id) +``` + +### Phase 2: Export Mappings +Generate `notion-link-mappings.json`: +```json +{ + "pages": { "notion-page-id": "page-slug" }, + "blocks": { "block-id": "readable-anchor" } +} +``` + +### Phase 3: Remark Plugin +Transform during Docusaurus build: +```typescript +// notion://page-id#block-id → /docs/page-slug#readable-anchor +``` + +**Why this approach?** +- ✅ Follows existing patterns +- ✅ Clean architecture +- ✅ Maintainable +- ✅ 10-14 hours effort + +--- + +## Documentation Structure + +### 📄 Read These Documents + +1. **[hash-links-specification.md](./hash-links-specification.md)** + - Full problem statement + - Requirements and expected behavior + - Success criteria + - **Read this first for context** + +2. **[hash-links-implementation-plan.md](./hash-links-implementation-plan.md)** + - Four different implementation approaches + - Detailed comparison matrix + - Recommended approach (Hybrid Strategy) + - Effort estimates and roadmap + - **Read this for choosing implementation approach** + +3. **[hash-links-technical-investigation.md](./hash-links-technical-investigation.md)** + - Detailed technical findings + - Current pipeline analysis + - Code references and examples + - Unanswered questions + - **Read this for technical deep-dive** + +--- + +## Next Steps for Team + +### 1. Review Documents (1-2 hours) +- [ ] Read specification +- [ ] Review implementation approaches +- [ ] Understand technical findings + +### 2. Team Discussion (1 hour) +- [ ] Choose implementation approach +- [ ] Decide on error handling strategy +- [ ] Assign implementation owner +- [ ] Set timeline + +### 3. Investigation Phase (2-3 hours) +Before implementation, verify: +- [ ] Create test Notion page with various link types +- [ ] Document exact notion-to-md output formats +- [ ] Confirm approach is viable + +### 4. Implementation (10-14 hours) +- [ ] Phase 1: Enhanced sanitizer (2-3h) +- [ ] Phase 2: Mapping exporter (2-3h) +- [ ] Phase 3: Remark plugin (4-5h) +- [ ] Testing (2-3h) + +### 5. Deployment & Validation (2-3 hours) +- [ ] Test with real content +- [ ] Deploy to preview +- [ ] Monitor and iterate + +**Total estimated time:** 15-23 hours from decision to deployment + +--- + +## Quick Comparison: Implementation Approaches + +| Approach | Effort | Risk | Maintainability | Recommendation | +|----------|--------|------|-----------------|----------------| +| A: Custom Post-Processing | 14-20h | Medium | Medium | ❌ Too complex | +| B: Remark Plugin Only | 10-13h | Low | High | ✅ Good option | +| C: Wait for Upstream | 2-6mo | High | High | ❌ Too slow | +| **D: Hybrid Strategy** | **10-14h** | **Low** | **High** | **✅ Recommended** | + +--- + +## Open Questions for Discussion + +### 1. Error Handling +**Question:** What happens when a link references a non-existent page? + +**Options:** +- A. Fail the build (strict) +- B. Warn and keep original URL (graceful) +- C. Convert to broken link marker (explicit) + +**Vote needed:** Team preference? + +### 2. I18n Behavior +**Question:** Should hash links be language-aware? + +**Example:** Portuguese page links to heading - go to PT or EN version? + +**Research needed:** Docusaurus i18n + hash behavior + +### 3. Caching +**Question:** Commit link mappings to git or regenerate? + +**Options:** +- A. Generate every time (slower, always fresh) +- B. Commit to git (faster, can be stale) + +**Recommendation:** Generate on Notion fetch + +--- + +## Risk Assessment + +### Low Risk ✅ +- Following existing patterns +- Clear implementation path +- Reversible changes + +### Medium Risk ⚠️ +- notion-to-md output format assumptions +- I18n edge cases +- Performance impact (likely minimal) + +### Mitigation Strategy +- Investigation phase validates assumptions +- Comprehensive testing plan +- Incremental implementation + +--- + +## Success Metrics + +### Functional +- ✅ Same-page hash links work +- ✅ Cross-page links work +- ✅ Cross-page + hash links work +- ✅ No regressions +- ✅ External links unchanged + +### Non-Functional +- ✅ Build time impact < 10% +- ✅ Clear error messages +- ✅ Documentation complete +- ✅ Tests passing + +--- + +## Related Issues & Resources + +### Codebase +- Current sanitizer: `scripts/notion-fetch/contentSanitizer.ts:107-117` +- Slug generation: `scripts/notion-fetch/generateBlocks.ts:582-593` +- Remark plugin example: `scripts/remark-fix-image-paths.ts` + +### External +- [notion-to-md Issue #161](https://github.com/souvikinator/notion-to-md/issues/161) - Upstream issue +- [Docusaurus MDX Plugins](https://docusaurus.io/docs/markdown-features/plugins) - Plugin docs +- [Notion Links Help](https://www.notion.com/help/create-links-and-backlinks) - How Notion links work + +--- + +## Decision Needed + +**Team:** Please review the three detailed documents and come to a meeting ready to discuss: + +1. ✅ or ❌ on recommended Hybrid Strategy approach +2. Decision on error handling (fail vs warn vs mark) +3. Decision on i18n behavior +4. Implementation owner assignment +5. Timeline commitment + +**Estimated meeting time:** 1 hour + +--- + +## Quick Start After Decision + +Once team approves approach: + +```bash +# 1. Create feature branch +git checkout -b feature/hash-links-support + +# 2. Investigation phase +# Create test Notion page, document findings + +# 3. Implementation (in order) +touch scripts/notion-fetch/anchorSlugifier.ts +touch scripts/notion-fetch/linkMappingExporter.ts +touch scripts/remark-notion-links.ts + +# 4. Tests +touch scripts/notion-fetch/anchorSlugifier.test.ts +touch scripts/notion-fetch/linkMappingExporter.test.ts +touch scripts/remark-notion-links.test.ts + +# 5. Update existing +# - scripts/notion-fetch/contentSanitizer.ts +# - scripts/notion-fetch/generateBlocks.ts +# - docusaurus.config.ts +``` + +--- + +**Questions?** Review the detailed documents or reach out to the team lead. + +**Ready to start?** Begin with investigation phase after team approval. + +--- + +**Document Version:** 1.0 +**Last Updated:** 2025-11-27 +**Status:** Ready for Team Review diff --git a/context/hash-links-technical-investigation.md b/context/hash-links-technical-investigation.md new file mode 100644 index 0000000..8adb7a1 --- /dev/null +++ b/context/hash-links-technical-investigation.md @@ -0,0 +1,758 @@ +# Hash Links Technical Investigation + +**Issue:** #93 - Hash links are simply skipped and are no links at all during rendering +**Investigation Date:** 2025-11-27 +**Status:** Complete + +--- + +## Executive Summary + +This document provides detailed technical findings from investigating how Notion links are currently processed through the notion-to-md library and the comapeo-docs pipeline. Key discovery: **notion-to-md outputs malformed HTML/JSX tags** instead of standard markdown links, requiring special handling. + +--- + +## Investigation Methodology + +### Tools Used +1. Codebase search (Grep) for link processing patterns +2. Analysis of existing content sanitizer +3. Review of notion-to-md library documentation +4. Web research on Notion API link handling +5. Analysis of existing remark plugin architecture + +### Files Analyzed +- `scripts/notion-fetch/contentSanitizer.ts` +- `scripts/notion-fetch/generateBlocks.ts` +- `scripts/notionClient.ts` +- `scripts/remark-fix-image-paths.ts` +- `docusaurus.config.ts` +- `context/quick-ref/block-examples.json` + +--- + +## Finding 1: notion-to-md Output Format + +### Discovery + +The notion-to-md library (v3.1.9) does **NOT** output standard markdown links for Notion page references and mentions. Instead, it outputs malformed HTML/JSX-like tags. + +### Evidence + +From `contentSanitizer.ts` (lines 107-117), we see patterns that are being sanitized: + +```typescript +// Fix malformed patterns +content = content.replace( + //gi, + "[link to section](#section)" +); + +// Fix other malformed tags with invalid attributes +content = content.replace(/]*[^\w\s"=-][^>]*>/g, "[link](#)"); + +// Fix malformed tags with invalid attributes +content = content.replace(/]*[^\w\s"=-][^>]*>/g, "[Link](#)"); +``` + +### Examples of Malformed Tags + +| notion-to-md Output | Current Sanitizer Output | Problem | +|---------------------|--------------------------|---------| +| `` | `[link to section](#section)` | Generic placeholder | +| `` | `[link](#)` | Lost all link info | +| `` | `[Link](#)` | Lost all link info | + +### Test Evidence + +From `contentSanitizer.test.ts` (lines 53-75): + +```typescript +test("should fix malformed link tags", () => { + const input = "Check for details."; + const result = sanitizeMarkdownContent(input); + expect(result).toBe("Check [link to section](#section) for details."); +}); + +test("should fix malformed Link tags with dots", () => { + const input = "Check for details."; + const result = sanitizeMarkdownContent(input); + expect(result).toBe("Check [link to section](#section) for details."); +}); + +test("should fix malformed Link tags with invalid attributes", () => { + const input = "Visit page."; + const result = sanitizeMarkdownContent(input); + expect(result).toBe("Visit [link](#) page."); +}); +``` + +### Implications + +1. **Current sanitizer discards link information** - converts everything to `[link](#)` +2. **Need to extract page/block IDs** from these malformed tags +3. **Tag format is unpredictable** - varies by link type + +--- + +## Finding 2: Notion Rich Text Structure + +### Notion API Format + +From `context/quick-ref/block-examples.json`: + +```json +{ + "type": "text", + "text": { + "content": "Example paragraph text", + "link": null + }, + "annotations": { + "bold": false, + "italic": false, + "strikethrough": false, + "underline": false, + "code": false, + "color": "default" + }, + "plain_text": "Example paragraph text", + "href": null +} +``` + +### Link Structure + +When a link is present: + +```json +{ + "type": "text", + "text": { + "content": "link text", + "link": { + "url": "https://example.com" + } + }, + "href": "https://example.com" +} +``` + +### Mention Structure + +From `scripts/notion-fetch/emojiExtraction.test.ts`: + +```json +{ + "type": "mention", + "mention": { + "type": "custom_emoji", + "custom_emoji": { + "url": "https://example.com/emoji1.png", + "name": "smile" + } + }, + "plain_text": ":smile:" +} +``` + +**Note:** Page mentions likely follow similar structure with `"type": "page"` or `"type": "link_to_page"` + +--- + +## Finding 3: Current Link Processing Pipeline + +### Step-by-Step Flow + +``` +┌────────────────────────────────────────────────────────────┐ +│ 1. Notion API - Raw Block Data │ +│ • Rich text arrays with link objects │ +│ • Mention objects for page references │ +│ • Block IDs for all content blocks │ +└────────────────────────────────────────────────────────────┘ + ↓ +┌────────────────────────────────────────────────────────────┐ +│ 2. notionClient.ts - Notion Client Setup │ +│ • Initializes NotionToMarkdown (n2m) │ +│ • Sets custom paragraph transformer │ +│ • Line 259: const n2m = new NotionToMarkdown(...) │ +└────────────────────────────────────────────────────────────┘ + ↓ +┌────────────────────────────────────────────────────────────┐ +│ 3. cacheLoaders.ts - Markdown Conversion │ +│ • Line 173: n2m.pageToMarkdown(pageId) │ +│ • Converts Notion blocks to markdown array │ +│ • notion-to-md processes rich_text → malformed tags │ +└────────────────────────────────────────────────────────────┘ + ↓ +┌────────────────────────────────────────────────────────────┐ +│ 4. generateBlocks.ts - Block Processing │ +│ • Line 280: n2m.toMarkdownString(markdown) │ +│ • Converts markdown array to string │ +│ • Returns structure: { parent: "...", child: {...} } │ +└────────────────────────────────────────────────────────────┘ + ↓ +┌────────────────────────────────────────────────────────────┐ +│ 5. contentSanitizer.ts - Fix Malformed Tags │ +│ • Line 107-117: Replace patterns │ +│ • Converts to markdown links │ +│ • PROBLEM: Discards actual link targets │ +└────────────────────────────────────────────────────────────┘ + ↓ +┌────────────────────────────────────────────────────────────┐ +│ 6. contentWriter.ts - Write Markdown Files │ +│ • Combines frontmatter + content │ +│ • Writes to docs/ or i18n/{lang}/docs/ │ +└────────────────────────────────────────────────────────────┘ + ↓ +┌────────────────────────────────────────────────────────────┐ +│ 7. Docusaurus Build │ +│ • Processes markdown files │ +│ • Applies remark plugins │ +│ • Line 283 (docusaurus.config.ts): │ +│ remarkPlugins: [remarkFixImagePaths] │ +└────────────────────────────────────────────────────────────┘ +``` + +### Key Code References + +**notionClient.ts - Initialization (line 259):** +```typescript +const n2m = new NotionToMarkdown({ notionClient: notion }); +``` + +**cacheLoaders.ts - Markdown Conversion (line 173):** +```typescript +fetchFn: (pageId) => n2m.pageToMarkdown(pageId) +``` + +**generateBlocks.ts - String Conversion (line 280):** +```typescript +const markdownString = n2m.toMarkdownString(markdown); +``` + +**contentSanitizer.ts - Current Fix Attempt (lines 107-117):** +```typescript +// 3. Fix malformed patterns +content = content.replace( + //gi, + "[link to section](#section)" +); + +// 4. Fix other malformed tags +content = content.replace(/]*[^\w\s"=-][^>]*>/g, "[link](#)"); + +// 5. Fix malformed tags +content = content.replace(/]*[^\w\s"=-][^>]*>/g, "[Link](#)"); +``` + +--- + +## Finding 4: Page URL Generation + +### Current Slug Generation + +From `generateBlocks.ts` (lines 582-593): + +```typescript +const filename = title + .toLowerCase() + .replace(/\s+/g, "-") + .replace(/[^a-z0-9-]/g, ""); +``` + +### Examples + +| Notion Page Title | Generated Slug | Generated URL | +|-------------------|----------------|---------------| +| "Getting Started" | `getting-started` | `/docs/getting-started` | +| "API Documentation" | `api-documentation` | `/docs/api-documentation` | +| "v2.0 Release Notes" | `v20-release-notes` | `/docs/v20-release-notes` | + +### Frontmatter Structure + +From `frontmatterBuilder.ts`: + +```yaml +--- +id: doc-getting-started +title: Getting Started +slug: /getting-started +--- +``` + +### Multi-Language Structure + +**English:** +- Slug: `installation-guide` +- File: `docs/installation-guide.md` +- URL: `/docs/installation-guide` + +**Portuguese:** +- Slug: `installation-guide` (same!) +- File: `i18n/pt/docs/installation-guide.md` +- URL: `/pt/docs/installation-guide` + +--- + +## Finding 5: Existing Remark Plugin Pattern + +### Current Implementation + +`scripts/remark-fix-image-paths.ts` (lines 1-30): + +```typescript +export default function remarkFixImagePaths() { + function transformNode(node: any): void { + if (!node || typeof node !== "object") return; + + // Markdown image nodes + if (node.type === "image" && typeof node.url === "string") { + if (node.url.startsWith("images/")) { + node.url = `/${node.url}`; + } + } + + // Raw HTML nodes possibly containing + if (node.type === "html" && typeof node.value === "string") { + node.value = node.value.replace(/src=(["'])images\//g, "src=$1/images/"); + } + + // Recurse into children + if (Array.isArray(node.children)) { + for (const child of node.children) transformNode(child); + } + } + + return (tree: any): void => { + transformNode(tree); + }; +} +``` + +### Configuration + +`docusaurus.config.ts` (line 283): + +```typescript +docs: { + path: "docs", + sidebarPath: "./src/components/sidebars.ts", + remarkPlugins: [remarkFixImagePaths], + // ... +} +``` + +### Pattern Analysis + +1. **Traverses AST nodes** recursively +2. **Checks node type** (image, html, etc.) +3. **Transforms URLs** in-place +4. **Handles nested structures** via recursion +5. **Simple and maintainable** + +--- + +## Finding 6: Page Metadata Cache System + +### Current Implementation + +From `pageMetadataCache.ts`: + +```typescript +interface PageMetadata { + lastEdited: string; // ISO timestamp + outputPaths: string[]; // Generated file paths + processedAt: string; // ISO timestamp +} + +interface CacheData { + version: string; + scriptHash: string; + pages: Record; +} +``` + +### Example Cache Entry + +```json +{ + "version": "2.0.0", + "scriptHash": "abc123...", + "pages": { + "notion-page-id-123": { + "lastEdited": "2025-11-27T10:00:00Z", + "outputPaths": ["docs/getting-started.md"], + "processedAt": "2025-11-27T10:05:00Z" + } + } +} +``` + +### Usage + +- **Incremental sync**: Skip unchanged pages +- **Deleted page detection**: Remove orphaned files +- **Change tracking**: Detect when re-processing needed + +### Extension Opportunity + +Could be extended to include: +- `slug: string` - Generated page slug +- `blockAnchors: Record` - Block ID → anchor mappings +- `linkedPages: string[]` - Pages this page links to + +--- + +## Finding 7: Block Fetching System + +### Block Structure + +From `fetchNotionData.ts` (line 285): + +```typescript +async function fetchNotionBlocks(pageId: string) { + const blocks = await enhancedNotion.blocksChildrenList({ + block_id: pageId, + }); + + // Recursively fetch nested blocks + for (const block of blocks) { + if (block.has_children) { + block.children = await fetchNotionBlocks(block.id); + } + } + + return blocks; +} +``` + +### Block Data Available + +Each block includes: +- `id` - Unique block identifier +- `type` - Block type (paragraph, heading_1, heading_2, etc.) +- `[type]` - Type-specific properties (e.g., `heading_1.rich_text`) +- `has_children` - Whether block has nested content +- `children` - Nested blocks (if fetched) + +### Heading Block Example + +```json +{ + "id": "block-abc123", + "type": "heading_1", + "heading_1": { + "rich_text": [ + { + "type": "text", + "text": { "content": "About This Guide", "link": null }, + "plain_text": "About This Guide" + } + ], + "is_toggleable": false, + "color": "default" + } +} +``` + +### Key Insight + +**All block content is already available during markdown generation!** +- No additional API calls needed for anchor generation +- Can extract heading text from `rich_text` arrays +- Can build block ID → content mapping during fetch + +--- + +## Finding 8: notion-to-md Limitations + +### Known Issue + +**GitHub Issue #161:** "Add support for 'Link to block' to URL hash conversion" +- **Status:** Open (created 2025-07-21) +- **Author:** hlysine (Henry Lin) +- **URL:** https://github.com/souvikinator/notion-to-md/issues/161 + +### Current Behavior + +From issue description: +> Currently, notion-to-md ignores the hash value entirely. The block ID information is stripped out during conversion, resulting in plain page references without any anchor or scroll target functionality. + +### Proposed API (from issue) + +```typescript +const n2m = new NotionConverter(notion) + .withPageReferences({ + urlPropertyNameNotion: 'URL', + transformBlockToUrlHash: block => "my-heading" + }) +``` + +### Version Information + +- **Current version in project:** 3.1.9 +- **Latest version:** 3.1.9 (as of investigation) +- **No fix available yet** + +--- + +## Finding 9: Docusaurus Heading Anchors + +### How Docusaurus Generates Anchors + +From web research ([Docusaurus Issue #9663](https://github.com/facebook/docusaurus/issues/9663)): + +Docusaurus uses GitHub-style slugification: + +```typescript +function slugify(text: string): string { + return text + .toLowerCase() + .trim() + .replace(/\s+/g, '-') // spaces → hyphens + .replace(/[^\w\-]+/g, '') // remove special chars + .replace(/\-\-+/g, '-') // collapse multiple hyphens + .replace(/^-+/, '') // trim leading hyphens + .replace(/-+$/, ''); // trim trailing hyphens +} +``` + +### Examples + +| Heading Text | Generated Anchor | +|--------------|------------------| +| "About This Guide" | `about-this-guide` | +| "v2.0 Release" | `v20-release` | +| "Getting Started!" | `getting-started` | +| "FAQ (Frequently Asked)" | `faq-frequently-asked` | + +### Custom Anchor IDs + +Docusaurus supports custom heading IDs ([Issue #3322](https://github.com/facebook/docusaurus/issues/3322)): + +```markdown +## About This Guide {#custom-id} +``` + +Generates: `

About This Guide

` + +### I18n Considerations + +From [Issue #11358](https://github.com/facebook/docusaurus/issues/11358): +- Hash links cause problems with Google Translate +- Anchors are currently case-sensitive +- No built-in translation of anchor IDs + +--- + +## Finding 10: Cross-Page Linking Gap + +### Critical Missing Component + +**No system exists to convert Notion page URLs to local doc URLs.** + +### What's Missing + +1. **Page ID → Slug Mapping** + - Need: `notion-page-id-123` → `getting-started` + - Current: Only exists in-memory during generation + +2. **URL Conversion** + - Need: `https://notion.so/page-id` → `/docs/getting-started` + - Current: No transformation happens + +3. **Link Rewriting** + - Need: Process markdown links to convert URLs + - Current: Links remain as Notion URLs + +### Evidence + +Searched codebase for: +- ❌ No "link rewriter" module +- ❌ No "URL mapper" system +- ❌ No custom link transformer in `notionClient.ts` +- ❌ No link processing in `contentSanitizer.ts` (only malformed tag fixes) + +--- + +## Unanswered Questions + +### 1. Exact Malformed Tag Format + +**Question:** What exact format does notion-to-md output for different link types? + +**Need to verify:** +- Same-page hash links +- Cross-page links +- Cross-page + hash links +- Page mentions (@Page Name) +- link_to_page blocks + +**Action:** Create test Notion page and run notion-to-md + +--- + +### 2. Block ID Availability + +**Question:** Are block IDs preserved in the malformed tags? + +**Example:** +- Does `` contain hidden block ID? +- Or is block ID completely lost? + +**Action:** Examine raw notion-to-md output before sanitization + +--- + +### 3. Page ID in Links + +**Question:** How does notion-to-md represent cross-page links? + +**Possibilities:** +- `` +- `` +- Lost completely? + +**Action:** Test with cross-page links in Notion + +--- + +### 4. I18n Hash Behavior + +**Question:** How do hash links work across language versions? + +**Test scenarios:** +- Portuguese page → Portuguese heading (same page) +- Portuguese page → English heading (cross-language) +- Should anchors be translated? + +**Action:** Research Docusaurus i18n documentation + +--- + +## Recommended Next Steps + +### 1. Immediate: Investigation Phase + +Create test Notion page with: +- [ ] Same-page hash link to heading +- [ ] Cross-page link (no hash) +- [ ] Cross-page link with hash to heading +- [ ] Page mention (@Page Name) +- [ ] link_to_page block +- [ ] External link (control) + +Run through pipeline and document: +- [ ] Raw Notion API response +- [ ] notion-to-md output (before sanitization) +- [ ] After sanitization +- [ ] Final markdown output + +### 2. Document Findings + +Create investigation report with: +- [ ] Exact tag formats discovered +- [ ] Block ID preservation (yes/no) +- [ ] Page ID availability +- [ ] Edge cases found + +### 3. Update Implementation Plan + +Based on investigation findings: +- [ ] Confirm or revise chosen approach +- [ ] Update effort estimates +- [ ] Identify additional requirements +- [ ] Create detailed technical spec + +### 4. Prototype + +Build minimal prototype: +- [ ] Enhanced sanitizer (extract IDs) +- [ ] Simple mapping system +- [ ] Basic link rewriting +- [ ] Validate approach works + +### 5. Full Implementation + +Only after prototype validated: +- [ ] Implement full solution +- [ ] Comprehensive testing +- [ ] Documentation +- [ ] Deployment + +--- + +## Technical Recommendations + +### 1. Use Remark Plugin Architecture + +**Rationale:** +- ✅ Follows existing pattern (`remark-fix-image-paths`) +- ✅ Standard Docusaurus approach +- ✅ Team already familiar with this +- ✅ More maintainable long-term + +### 2. Enhance Content Sanitizer + +**Changes needed:** +```typescript +// OLD: Discard link information +content.replace(//gi, "[link](#)"); + +// NEW: Extract and preserve +content.replace(/]+)>/gi, (match, linkInfo) => { + const { pageId, blockId, text } = extractLinkInfo(linkInfo); + return `[${text}](notion://${pageId}${blockId ? '#' + blockId : ''})`; +}); +``` + +### 3. Export Link Mappings + +**New module:** `linkMappingExporter.ts` + +```typescript +interface LinkMappings { + version: string; + generated: string; + pages: Record; // pageId → slug + blocks: Record; // blockId → anchor +} +``` + +**Output:** `scripts/notion-link-mappings.json` + +### 4. Create Remark Plugin + +**New module:** `remark-notion-links.ts` + +- Load mappings from JSON +- Transform `notion://` URLs +- Generate proper local paths +- Validate and warn on broken links + +--- + +## References + +### Codebase Files +- `scripts/notion-fetch/contentSanitizer.ts` - Current sanitization +- `scripts/notion-fetch/generateBlocks.ts` - Page processing +- `scripts/notionClient.ts` - notion-to-md initialization +- `scripts/remark-fix-image-paths.ts` - Existing remark plugin +- `scripts/notion-fetch/pageMetadataCache.ts` - Cache system + +### External Resources +- [notion-to-md Issue #161](https://github.com/souvikinator/notion-to-md/issues/161) +- [Docusaurus MDX Plugins](https://docusaurus.io/docs/markdown-features/plugins) +- [Docusaurus Hash Links Issue #11358](https://github.com/facebook/docusaurus/issues/11358) +- [Docusaurus Heading IDs Issue #3322](https://github.com/facebook/docusaurus/issues/3322) +- [Notion Help: Links & Backlinks](https://www.notion.com/help/create-links-and-backlinks) + +--- + +**Document Version:** 1.0 +**Investigation Complete:** 2025-11-27 +**Investigator:** Claude (AI Assistant) +**Status:** Ready for team review