Skip to content

Reference implementation for chunking nested JSON into RAG-friendly document structures

License

Notifications You must be signed in to change notification settings

Mocksi/json-rag

Repository files navigation

JSON RAG Integration

A tool for efficiently loading and integrating nested JSON data structures into RAG (Retrieval-Augmented Generation) systems, with enhanced entity tracking, relationship detection, and context preservation.

Key Features

  • Advanced Query Understanding:

    • Temporal patterns (exact dates, relative ranges, named periods)
    • Metric aggregations (average, maximum, minimum, sum, count)
    • Entity relationships (direct, semantic, and cross-file connections)
    • State transitions and system conditions
    • Hybrid search combining vector similarity, relationships, and filters
  • Smart Data Processing:

    • Automatic entity detection and relationship mapping
    • Cross-file relationship detection and validation
    • Key-value pair extraction for filtered searches
    • Embedded metadata tracking
    • Batch processing with change detection
  • Archetype-Aware Processing:

    • Pattern detection (entities, events, metrics, collections)
    • Archetype-based scoring and ranking
    • Relationship validation by archetype
    • Context-aware embedding generation
    • Archetype-specific traversal strategies
  • Hierarchical Data Management:

    • Full JSON structure preservation
    • Parent-child relationship tracking
    • Cross-file relationship mapping
    • Contextual embedding with ancestry
    • Path-based chunk identification
  • Enhanced Retrieval:

    • Vector similarity search using PGVector
    • Relationship-aware context assembly
    • Entity-aware result filtering
    • Cross-file context expansion
    • Confidence-based scoring and ranking

Quick Start

  1. Clone and install:
git clone https://github.com/Mocksi/json-rag.git
cd json_rag
python -m venv rag_env
source rag_env/bin/activate  # Windows: rag_env\Scripts\activate
pip install -r requirements.txt
  1. Configure database:
# Update POSTGRES_CONN_STR in app/config.py:
POSTGRES_CONN_STR = "dbname=myragdb user=your_user host=localhost port=5432"
  1. Set up environment:
# Create .env file with:
OPENAI_API_KEY=your-key-here
  1. Initialize and run:
python -m app.main --new  # Truncates all tables and starts fresh
python -m app.main        # Normal operation

Architecture

app/
├── analysis/           # Analysis and pattern detection
│   ├── archetype.py   # Pattern and archetype detection
│   └── relationships.py# Cross-file relationship analysis
├── core/              # Core system components
│   ├── config.py      # Configuration settings
│   └── models.py      # Data models
├── processing/        # Data processing modules
│   ├── json_parser.py # JSON structure parsing
│   ├── parsing.py     # Document parsing and chunking
│   └── processor.py   # Data processing pipeline
├── retrieval/         # Query processing and retrieval
│   ├── embedding.py   # Vector embedding generation
│   └── retrieval.py   # Query pipeline and execution
├── storage/           # Data persistence
│   └── database.py    # PostgreSQL and vector storage
├── utils/             # Utility modules
│   └── logging_config.py # Logging configuration
├── __init__.py        # Package initialization
├── chat.py           # Chat interface and interactions
└── main.py           # Application entry point

The codebase is organized into logical modules:

  • analysis/: Modules for analyzing data patterns, cross-file relationships, and user intent
  • core/: Core system configuration and shared components
  • processing/: Data processing and relationship detection modules
  • retrieval/: Relationship-aware search and context assembly
  • storage/: Database interaction and relationship persistence
  • utils/: Shared utility functions and helpers

Each module is designed to be independent with clear responsibilities, while working together through well-defined interfaces.

Installation Requirements

  • Python 3.8 or higher
  • PostgreSQL 12 or higher with PGVector extension
  • OpenAI API key
  • Required Python packages (see requirements.txt)

Documentation

The codebase features comprehensive inline documentation:

  • Detailed module-level docstrings explaining key concepts
  • Function and class documentation with examples
  • Type hints and parameter descriptions
  • Usage examples and implementation notes

Contributing

We welcome contributions! Please see our Contributing Guide for details on:

  • Setting up your development environment
  • Code style guidelines
  • Pull request process
  • Development workflow

Code of Conduct

This project follows the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior.

License

MIT License - see LICENSE file for details.

Roadmap

  • Cross-file relationship detection
  • Archetype-aware retrieval
  • Relationship-based context expansion
  • Confidence scoring algorithm refinement
  • State transition handling improvements
  • Batch processing optimization
  • Metric aggregation capabilities
  • Entity filtering rules improvement
  • Context assembly performance optimization
  • Advanced archetype pattern detection

Query Pipeline

The system implements a structured reasoning pipeline:

  1. Query Analysis:

    • Determines required data types
    • Identifies needed operations (filtering, aggregation)
    • Detects relationships and constraints
  2. Plan Creation:

    • Builds retrieval strategy
    • Plans processing operations
    • Determines result formatting
  3. Execution:

    • Retrieves relevant chunks
    • Processes according to plan
    • Assembles coherent response

This systematic approach ensures consistent and reliable query handling while preserving context and relationships.

About

Reference implementation for chunking nested JSON into RAG-friendly document structures

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages