bef68b9fa54bc97e7a875de858e07262e0f22554
Added functionality to load and process nested `.code-tokenizer-md-ignore` files for excluding patterns during markdown generation. Introduced new dependencies and ensured initialization before file processing operations.
code-tokenizer-md
Created to push creative limits.
Process git repository files into markdown with token counting and sensitive data redaction.
Overview
code-tokenizer-md
is a TypeScript/Bun tool that processes git repository files, cleans code, redacts sensitive information, and generates markdown documentation with token counts.
graph TD
Start[Start] -->|Read| Git[Git Files]
Git -->|Clean| TC[TokenCleaner]
TC -->|Redact| Clean[Clean Code]
Clean -->|Generate| MD[Markdown]
MD -->|Count| Results[Token Counts]
style Start fill:#000000,stroke:#FFFFFF,stroke-width:4px,color:#ffffff
style Git fill:#222222,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style TC fill:#333333,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style Clean fill:#444444,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style MD fill:#555555,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style Results fill:#666666,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
Features
Data Processing
- Reads files from git repository
- Removes comments and unnecessary whitespace
- Redacts sensitive information (API keys, tokens, etc.)
- Counts tokens using llama3-tokenizer
Analysis Types
- Token counting per file
- Total token usage
- File content analysis
- Sensitive data detection
Data Presentation
- Markdown formatted output
- Code block formatting
- Token count summaries
- File organization hierarchy
Requirements
- Node.js (>=14.0.0)
- Bun runtime
- Git repository
Installation
Usage
CLI
npx code-tokenizer-md
Library
npm install code-tokenizer-md
Programmatic Usage
import { MarkdownGenerator } from 'code-tokenizer-md';
const generator = new MarkdownGenerator({
dir: './project',
outputFilePath: './output.md',
});
const result = await generator.createMarkdownDocument();
Project Structure
src/
├── index.ts # Main exports
├── TokenCleaner.ts # Code cleaning and redaction
├── MarkdownGenerator.ts # Markdown generation logic
├── cli.ts # CLI implementation
├── fileExclusions.ts # File exclusion patterns
└── fileTypeExclusions.ts # File type exclusions
Dependencies
{
"dependencies": {
"llama3-tokenizer-js": "^1.0.0",
"micromatch": "^4.0.8"
},
"peerDependencies": {
"node": ">=14.0.0"
},
"devDependencies": {
"@eslint/js": "^9.14.0",
"eslint": "^9.14.0",
"globals": "^15.12.0",
"prettier": "^3.3.3",
"bun": "latest",
"@types/bun": "latest",
"@types/node": "^22.9.1",
"@types/micromatch": "^4.0.9"
}
}
Development
This project uses bun for it's toolchain. You should be able to use whatever you want as a consumer of the library.
Building
npm run build
Testing
npm test
Linting and Formatting
# Lint
npm run lint
# Fix linting issues
npm run lint:fix
# Format code
npm run format
# Fix all (format + lint)
npm run fix
Extending
Adding Custom Patterns
const generator = new MarkdownGenerator({
customPatterns: [{ regex: /TODO:/g, replacement: '' }],
customSecretPatterns: [{ regex: /mySecret/g, replacement: '[REDACTED]' }],
});
Configuration Options
interface MarkdownGeneratorOptions {
dir?: string; // Project directory
outputFilePath?: string; // Output markdown file path
fileTypeExclusions?: Set<string>; // File types to exclude
fileExclusions?: string[]; // File patterns to exclude
customPatterns?: Record<string, any>; // Custom cleaning patterns
customSecretPatterns?: Record<string, any>; // Custom redaction patterns
verbose?: boolean; // Enable verbose logging
}
Contributing
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
Contribution Guidelines
- Write TypeScript code following the project's style
- Include appropriate error handling
- Add documentation for new features
- Include tests for new functionality
- Update the README for significant changes
License
MIT © 2024 Geoff Seemueller
Note
This tool requires a git repository to function properly.
Languages
TypeScript
90.5%
JavaScript
8%
CSS
1.5%