Geoff Seemueller d1d72dae28 Enhance file exclusion logic and update version to 1.0.5
Refactored file exclusion patterns in MarkdownGenerator.js to handle various glob patterns and types more comprehensively. Added normalization for paths, extended support for different file types, and included specific directory-only and wildcard matching. Updated the package version to 1.0.5 in package.json to reflect these changes.
2024-11-21 12:40:45 -05:00
2024-11-07 11:20:52 -05:00
2024-11-07 11:37:56 -05:00
2024-11-07 11:37:56 -05:00

code-tokenizer-md

Created to push creative limits.

Process git repository files into markdown with token counting and sensitive data redaction.

Overview

code-tokenizer-md is a Node.js tool that processes git repository files, cleans code, redacts sensitive information, and generates markdown documentation with token counts.

graph TD
   Start[Start] -->|Read| Git[Git Files]
   Git -->|Clean| TC[TokenCleaner]
   TC -->|Redact| Clean[Clean Code]
   Clean -->|Generate| MD[Markdown]
   MD -->|Count| Results[Token Counts]
   style Start fill:#000000,stroke:#FFFFFF,stroke-width:4px,color:#ffffff
   style Git fill:#222222,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
   style TC fill:#333333,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
   style Clean fill:#444444,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
   style MD fill:#555555,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
   style Results fill:#666666,stroke:#FFFFFF,stroke-width:2px,color:#ffffff

Features

Data Processing

  • Reads files from git repository
  • Removes comments and unnecessary whitespace
  • Redacts sensitive information (API keys, tokens, etc.)
  • Counts tokens using llama3-tokenizer

Analysis Types

  • Token counting per file
  • Total token usage
  • File content analysis
  • Sensitive data detection

Data Presentation

  • Markdown formatted output
  • Code block formatting
  • Token count summaries
  • File organization hierarchy

Requirements

  • Node.js (>=14.0.0)
  • Git repository
  • npm or npx

Installation

npm install -g code-tokenizer-md

Usage

Quick Start

npx code-tokenizer-md

Programmatic Usage

import { MarkdownGenerator } from 'code-tokenizer-md';

const generator = new MarkdownGenerator({
  dir: './project',
  outputFilePath: './output.md',
});

const result = await generator.createMarkdownDocument();

Project Structure

src/
├── index.js              # Main exports
├── TokenCleaner.js       # Code cleaning and redaction
├── MarkdownGenerator.js  # Markdown generation logic
└── cli.js               # CLI implementation

Dependencies

{
  "dependencies": {
    "llama3-tokenizer-js": "^1.0.0"
  },
  "peerDependencies": {
    "node": ">=14.0.0"
  }
}

Extending

Adding Custom Patterns

const generator = new MarkdownGenerator({
  customPatterns: [{ regex: /TODO:/g, replacement: '' }],
  customSecretPatterns: [{ regex: /mySecret/g, replacement: '[REDACTED]' }],
});

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

Contribution Guidelines

  • Follow Node.js best practices
  • Include appropriate error handling
  • Add documentation for new features
  • Include tests for new functionality (this project needs a suite)
  • Update the README for significant changes

License

MIT © 2024 Geoff Seemueller

Note

This tool requires a git repository to function properly.

Description
it's a tool, it's a library, it's regular expressions!
Readme AGPL-3.0 952 KiB
Languages
TypeScript 90.5%
JavaScript 8%
CSS 1.5%