Use Google Docs as a Knowledge Base for AI Agents

Extend the capbilities of Cursor Composer and other agentic AI tools by providing them with the knowledge they need to do useful work for you.

This utility automatically synchronizes Google Docs and Sheets from multiple Google Drive folders to local files and generates rich metadata to be used as a knowledge base for LLMs:

Google Docs are converted to Markdown files with rich metadata
Google Sheets are converted to CSV files with rich metadata
Uses async processing for efficient metadata generation via LLM
Produces a manifest containing metadata for each file
Supports multiple Google Drive folders with separate local destinations

Setup

Clone this repository:

git clone https://github.com/chrisboden/knowledgesync.git
cd knowledgesync

Install dependencies:
```
pip install -r requirements.txt
```
Set up Google Cloud Project and APIs:
1. Go to Google Cloud Console
2. Create a new project (or select an existing one)
3. Enable the required APIs:
  - Go to "APIs & Services" > "Library"
  - Enable "Google Drive API"
  - Enable "Google Sheets API"
  - Click "Enable" for each
4. Configure OAuth consent screen:
  - Go directly to OAuth consent screen setup
  - Select "External" user type (unless you have Google Workspace)
  - Fill in the required fields:
    - App name: Your app name (e.g., "Knowledge Sync")
    - User support email: Your email
    - Developer contact email: Your email
  - Skip adding scopes (they'll be added automatically)
  - Add your email as a test user
  - You can skip optional fields - they're not needed for testing
5. Create OAuth 2.0 credentials:
  - Go to "APIs & Services" > "Credentials"
  - Click "Create Credentials" > "OAuth client ID"
  - Choose "Desktop application" as the application type
  - Name your client (e.g., "Knowledge Sync Desktop")
  - Download the credentials JSON file
  - Save it as credentials.json in the project root
Note: If you get stuck in a loop on the branding page, use the direct link to the OAuth consent screen: https://console.cloud.google.com/apis/credentials/consent

Create a .env file with:

# Google Drive folder IDs (from the folder URLs)
# Example URL: https://drive.google.com/drive/folders/1234567890abcdef
# Specify as a JSON object mapping folder names to IDs:
DRIVE_FOLDERS={"knowledge": "folder_id_1", "projects": "folder_id_2", "team": "folder_id_3"}

# Local destination folder for synced files
# Use absolute path for best results
DESTINATION_FOLDER=/path/to/local/folder

# OpenRouter Configuration (for AI metadata extraction)
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
OPENROUTER_API_KEY=your_openrouter_api_key

Each folder in DRIVE_FOLDERS will be synced to its own subdirectory under DESTINATION_FOLDER. For example:

knowledge folder syncs to DESTINATION_FOLDER/knowledge/
projects folder syncs to DESTINATION_FOLDER/projects/
team folder syncs to DESTINATION_FOLDER/team/

Usage

One-time sync (Google Drive to Local):

python main.py

Two-way sync (Google Drive to Local and Local to Google Drive):

python main.py --two-way

Push-only sync (Local to Google Drive):

python push_to_drive.py

Clean up sync state (after manual file deletions):

python clean_sync_state.py

This utility helps maintain a clean sync state by:

Validating all entries in the sync state against both local files and Google Drive
Removing entries for files that no longer exist on either side
Creating a backup of the sync state file before making changes
Displaying detailed information about what was removed

Upgrading to Two-Way Sync

If you're upgrading from a previous version with one-way sync only:

Backup your synced folders first
Clean your sync state: python clean_sync_state.py
Run a pull operation: python main.py (to ensure local files are updated)
Then run a push operation: python push_to_drive.py (to establish file relationships)

Update your services if using automated syncing:

cp com.gdocs-sync.service.plist ~/Library/LaunchAgents/
cp com.gdocs-push.service.plist ~/Library/LaunchAgents/
launchctl unload ~/Library/LaunchAgents/com.gdocs-sync.service.plist
launchctl load ~/Library/LaunchAgents/com.gdocs-sync.service.plist
launchctl unload ~/Library/LaunchAgents/com.gdocs-push.service.plist
launchctl load ~/Library/LaunchAgents/com.gdocs-push.service.plist

Remember that local changes will now be pushed to Google Drive, so be mindful of what you edit locally. Run clean_sync_state.py after manual file deletions.

Dedicated push tool:

python push_to_drive.py

Automated hourly sync (macOS):

Make the sync script executable:
```
chmod +x sync_docs.sh
```

Edit the sync service plist to set your workspace path:

# Replace /path/to/your/workspace with your actual workspace path
sed -i '' "s|/Users/chrisboden/Desktop/trycursor|$(pwd)|g" com.gdocs-sync.service.plist

Copy the launchd plist to your user's LaunchAgents:

cp com.gdocs-sync.service.plist ~/Library/LaunchAgents/

Load the service:

launchctl load ~/Library/LaunchAgents/com.gdocs-sync.service.plist

Automated hourly push (macOS):

Make the push script executable:
```
chmod +x push_to_drive.sh
```

Edit the push service plist to set your workspace path:

# Replace /path/to/your/workspace with your actual workspace path
sed -i '' "s|/Users/chrisboden/Dropbox/tools/knowledgesync|$(pwd)|g" com.gdocs-push.service.plist

Copy the launchd plist to your user's LaunchAgents:

cp com.gdocs-push.service.plist ~/Library/LaunchAgents/

Load the service:

launchctl load ~/Library/LaunchAgents/com.gdocs-push.service.plist

Check that services are running:

launchctl list | grep -E 'gdocs-sync|gdocs-push'
# Should show two entries with the services listed

If services are not running or show exit codes, unload and reload:

launchctl unload ~/Library/LaunchAgents/com.gdocs-sync.service.plist
launchctl load ~/Library/LaunchAgents/com.gdocs-sync.service.plist
launchctl unload ~/Library/LaunchAgents/com.gdocs-push.service.plist
launchctl load ~/Library/LaunchAgents/com.gdocs-push.service.plist

Features

Support for multiple Google Drive folders:
- Each folder syncs to its own subdirectory
- Independent metadata tracking per folder
- Parallel processing of folders
Automatic sync of Google Workspace files:
- Google Docs to Markdown
- Google Sheets to CSV (one CSV per worksheet)
Two-way sync support:
- Push local changes back to Google Drive
- Create new files in Google Drive from local files
- Track file relationships between local and remote
Spreadsheet ID verification:
- Verifies spreadsheet IDs match expected titles before updating
- Provides clear warnings for mismatches
- Displays all file IDs during sync operations for transparency
Sync state maintenance utilities:
- Clean up sync state after manual file deletions
- Prevent errors from missing files or incorrect IDs
Delta updates (only syncs changed files)
Rich metadata generation:
- AI-powered metadata extraction for both documents and spreadsheets
- Only generates metadata for new or modified files
- Parallel processing using async API calls
- Maintains metadata manifests for quick lookups
Clean image handling (references without base64 data)
Colored progress output
Error handling and logging
Secure OAuth2 authentication
Automated scheduling via launchd (macOS)

Directory Structure

.
├── .env                                # Environment configuration (created from .env_example)
├── .env_example                        # Template for environment variables
├── .gitignore                         # Git ignore rules
├── README.md                          # Project documentation
├── credentials.json                    # Your Google OAuth credentials (created from credentials_example.json)
├── main.py                           # Main sync script
├── token.pickle                       # Google API auth token (auto-generated on first run)
├── temp/                             # Temporary files for knowledge base queries
│   └── knowledge_base.md             # Generated knowledge base (temporary)
├── tools/                            # CLI tools
│   ├── README.md                     # Tools documentation
│   └── use_knowledge.py              # Knowledge base query tool
├── utils/                            # Utility modules
│   ├── __init__.py                  # Module exports
│   ├── google_drive_ops.py          # Google Drive operations
│   ├── google_sheets_ops.py         # Google Sheets operations
│   ├── document_metadata_ops.py     # Document metadata operations
│   └── spreadsheet_metadata_ops.py  # Spreadsheet metadata operations
├── prompts/                          # AI prompt templates
│   ├── extract_document_metadata.md  # Document metadata extraction prompt
│   └── extract_spreadsheet_metadata.md  # Spreadsheet metadata extraction prompt
├── requirements.txt                  # Python dependencies
├── sync_docs.sh                      # Sync automation script
├── com.gdocs-sync.service.plist      # macOS launch daemon config
└── destination_folder/               # Your configured sync destination (name matches DESTINATION_FOLDER)
    ├── knowledge/                    # First Drive folder sync
    │   ├── documents/               # Synchronized markdown documents
    │   │   ├── @manifest.json      # Document metadata manifest (auto-generated)
    │   │   └── *.md               # Converted Google Docs
    │   └── spreadsheets/          # Synchronized CSV files
    │       ├── @manifest.json     # Spreadsheet metadata manifest (auto-generated)
    │       └── */                 # One directory per spreadsheet
    │           └── *.csv         # One CSV per worksheet
    ├── projects/                   # Second Drive folder sync (same structure)
    │   ├── documents/
    │   └── spreadsheets/
    └── team/                      # Third Drive folder sync (same structure)
        ├── documents/
        └── spreadsheets/

Metadata Management

The utility maintains two manifest files per folder:

{folder}/documents/@manifest.json - Document metadata including:
- Document summaries
- Topics and categories
- Document sections
- Last update timestamps
- Processing status
{folder}/spreadsheets/@manifest.json - Spreadsheet metadata including:
- Spreadsheet purpose and content
- Data types and relationships
- Worksheet descriptions
- Last update timestamps
- Processing status

Metadata is only generated when:

A new file is added
An existing file is modified
A file's metadata is missing or incomplete

Using Your Knowledge Base with AI Agents

The synchronized documents and their metadata can be used by AI agents to perform knowledge work. The repository includes a tool called use_knowledge.py that demonstrates this capability:

How the Knowledge Tool Works

Intelligent Document Selection:
- Takes a natural language query about your knowledge base
- Uses LLM to analyze document and spreadsheet manifests
- Selects the most relevant files based on metadata (summaries, topics, sections)
- Supports searching across all synchronized folders
Knowledge Base Creation:
- Combines selected documents and spreadsheets into a single markdown file
- Converts spreadsheets to markdown tables for easy reading
- Includes metadata about why each document was selected
- Creates a temporary file that can be used by AI agents
Agent Integration:
- Perfect for use with AI agents (like Cursor's Composer)
- Agents can read the combined knowledge base
- Makes informed responses based on your organization's knowledge
- Maintains source attribution for all information

Example Usage

Command line:

python tools/use_knowledge.py "what are our current AI initiatives?"

The tool will:

Search through all document and spreadsheet manifests
Select relevant files based on the query
Create a combined markdown file with:
- Document content
- Converted spreadsheet data
- Selection rationales
- Source attribution

This temporary knowledge base can then be used by AI agents to:

Answer questions about your organization
Analyze project data
Generate reports
Make recommendations based on your actual documents

Benefits

Context-Aware: Agents work with your actual organizational knowledge
Source Attribution: All information can be traced back to source documents
Cross-Document Analysis: Combines information from multiple sources
Format Agnostic: Works with both documents and spreadsheets
Efficient: Uses metadata for quick document selection
Privacy-Focused: Works with your local files, no cloud storage needed

Monitoring

Check sync status: launchctl list | grep gdocs-sync
Check push status: launchctl list | grep gdocs-push

View logs:

tail -f sync.log     # For sync output
tail -f sync.error.log  # For sync error messages
tail -f push.log     # For push output
tail -f push.error.log  # For push error messages

Troubleshooting

If authentication fails:
- Delete token.pickle and try again
- Ensure credentials.json is in the project root
- Check you're added as a test user in OAuth consent screen
If sync fails:
- Check the Google Drive folder ID is correct
- Ensure you have read access to the source folder
- Review error messages in sync.error.log
If spreadsheet sync has issues:
- Run python clean_sync_state.py to ensure the sync state is clean
- Check if spreadsheet IDs or names have changed
- Look for warning messages about spreadsheet ID verification
If files were manually deleted:
- Run python clean_sync_state.py to update the sync state
- This prevents errors when trying to update non-existent files
If scheduled sync isn't running:
- Check service status: launchctl list | grep gdocs-sync
- Ensure paths in com.gdocs-sync.service.plist are correct
- Review system logs: log show --predicate 'subsystem == "com.gdocs-sync.service"'
If metadata extraction fails:
- Check OpenRouter API key and base URL in .env
- Verify the file content is accessible
- Look for token limit warnings in the logs

Spreadsheet Formatting

When working with spreadsheets, follow these guidelines to ensure proper synchronization:

Use Standard CSV Formatting:
- Avoid using comment lines (lines starting with #) in CSV files
- Ensure CSV files have proper headers as the first row
- Maintain consistent column structure throughout the file
Spreadsheet Structure:
- Each spreadsheet is a directory containing one or more CSV files
- Each CSV file represents a worksheet in Google Sheets
- The directory name becomes the spreadsheet name in Google Drive

Example of properly formatted CSV:

Name,Email,Role,Department
John Doe,[email protected],Developer,Engineering
Jane Smith,[email protected],Designer,Product

Improper formatting (avoid):

# This is a comment line that may cause sync issues
Name,Email,Role,Department
John Doe,[email protected],Developer,Engineering

Testing

Several test scripts are available in the tests directory, to verify functionality:

test_new_files.py - Tests creation of new files & verifies that new files are properly created in Google Drive.
test_edit_files.py - This test verifies that edits to local files are properly synced to Google Drive.
test_sync.py - Tests basic two-way sync functionality
test_spreadsheet_sync.py - Tests spreadsheet sync functionality
test_push_service.py - Tests the configuration of the automated push service
test_edit_files_standard.py - Tests editing and syncing existing files with standard CSV formatting

Run each test with:

python3 tests/test_script_name.py

For more detailed testing instructions, see TESTING.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Use Google Docs as a Knowledge Base for AI Agents

Setup

Usage

One-time sync (Google Drive to Local):

Two-way sync (Google Drive to Local and Local to Google Drive):

Push-only sync (Local to Google Drive):

Clean up sync state (after manual file deletions):

Upgrading to Two-Way Sync

Dedicated push tool:

Automated hourly sync (macOS):

Automated hourly push (macOS):

Features

Directory Structure

Metadata Management

Using Your Knowledge Base with AI Agents

How the Knowledge Tool Works

Example Usage

Benefits

Monitoring

Troubleshooting

Spreadsheet Formatting

Testing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.cursor/rules		.cursor/rules
prompts		prompts
tools		tools
utils		utils
.cursorrules		.cursorrules
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
clean_sync_state.py		clean_sync_state.py
com.gdocs-push.service.plist		com.gdocs-push.service.plist
credentials_example.json		credentials_example.json
main.py		main.py
push_to_drive.py		push_to_drive.py
push_to_drive.sh		push_to_drive.sh
requirements.txt		requirements.txt
sync_improvements.md		sync_improvements.md
synchronization_fix_summary.md		synchronization_fix_summary.md

chrisboden/knowledgesync

Folders and files

Latest commit

History

Repository files navigation

Use Google Docs as a Knowledge Base for AI Agents

Setup

Usage

One-time sync (Google Drive to Local):

Two-way sync (Google Drive to Local and Local to Google Drive):

Push-only sync (Local to Google Drive):

Clean up sync state (after manual file deletions):

Upgrading to Two-Way Sync

Dedicated push tool:

Automated hourly sync (macOS):

Automated hourly push (macOS):

Features

Directory Structure

Metadata Management

Using Your Knowledge Base with AI Agents

How the Knowledge Tool Works

Example Usage

Benefits

Monitoring

Troubleshooting

Spreadsheet Formatting

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages