Key Facts
- space-y-2 text-gray-700 dark:text-gray-300">
- ✓ The tool uses Claude Code to query a public read-only SQL and vector database.
- ✓ It covers Hacker News, arXiv, LessWrong, and other public commons sites.
- ✓ Current embedded data includes 1.4M posts and 15.6M comments using Voyage-3.5-lite.
- ✓ Features include an Alerts system for email notifications on specific criteria.
- ✓ Compositional vector search allows filtering by sentiment and topic simultaneously.
Quick Summary
A developer has introduced a powerful research tool that leverages Claude Code to query a massive, public read-only SQL and vector database. This system aggregates data from various high-quality public commons sites, including Hacker News, arXiv, and LessWrong. The tool is designed to answer nuanced questions by generating complex SQL queries that run safely on the developer's machine.
Key features include an automated alert system and advanced compositional vector search capabilities. Currently, the database hosts 1.4 million posts and 15.6 million comments embedded with Voyage-3.5-lite. While the developer aims to expand coverage, financial limitations currently prevent embedding all available sources.
Core Functionality and Architecture
The research tool operates by allowing users to paste a prompt into Claude Code which contains an embedded API key. This key grants access to a public read-only database containing both SQL and vector data. The primary function of the tool is to enable state-of-the-art research across a wide array of public data sources.
Instead of running queries directly on external platforms, Claude generates "monster SQL queries" that are executed safely on the developer's local machine. This approach allows for the processing of complex, nuanced questions that standard search engines might struggle to answer. The system effectively acts as an intermediary, translating user intent into executable database commands.
The database currently aggregates data from dozens of high-quality public commons sites. The scale of the data currently embedded includes:
- 1.4 million posts
- 4.6 million total posts (implied total)
- 15.6 million comments
- 38 million total comments (implied total)
These embeddings are generated using the Voyage-3.5-lite model.
Advanced Search and Alerts 📢
Beyond simple querying, the tool offers sophisticated search capabilities and an automated alert system. The Alerts functionality is particularly useful for monitoring specific, hard-to-track topics. Users can ask Claude to submit a SQL query as an alert, which triggers an email notification whenever the ultra-nuanced criteria are met and the output changes.
For example, a user could set an alert to be notified when someone posts about "estrogen" in a psychoactive context, or when enough biology metaphors are used in discussions about building infrastructure. This allows for precise monitoring of niche topics across the public commons.
The system also supports compositional vector search, a technique that allows for highly specific filtering. An example provided demonstrates how to search for writing about the "FTX crisis" that is distinctly free of guilty tones, yet may still mention the word "guilt." This is achieved through a query structure resembling: @FTX_crisis - (@guilt_tone - @guilt_topic).
Scope and Limitations
The project aims to embed "everything and all the other sources" to create a comprehensive research environment. However, the developer notes a significant limitation regarding resources. While the technical capability exists to embed additional sources cheaply, the developer states they "literally don't have the money" to expand the dataset further at this time.
Despite these financial constraints, the current implementation covers a vast landscape of information. By focusing on sites like Hacker News, arXiv, and LessWrong, the tool targets communities known for high-quality technical and intellectual discourse. The ability to query these specific datasets via natural language prompts represents a significant step forward in accessible data analysis.
Conclusion
The introduction of this Claude Code-powered research tool demonstrates the potential for large language models to interact with massive, specialized datasets. By combining SQL generation, vector search, and automated alerting, the system provides a robust framework for deep research into public commons data. While currently limited by funding, the existing prototype offers a glimpse into the future of automated, nuanced information retrieval.




