A lightweight, open framework to deliver high-quality structured data to LLMs
Join the DiscussionLangShake is a modern web protocol that makes your site machine-readable, AI-friendly, and verifiable — without changing your frontend.
It introduces a three-part system to expose structured content to Large Language Models (LLMs), search agents, and automated crawlers:
LangShake adds new fields to your existing sitemap.xml, allowing you to link to structured content and verify its integrity.
<url> <loc>https://example.com/article</loc> <lastmod>2025-04-16T17:58:00Z</lastmod> <langshake:schema-url>https://example.com/langshake/article.json</langshake:schema-url> <langshake:checksum>8f7a9b3cf5a...a8b9c07e8f</langshake:checksum> </url>
This is a standalone JSON file following Schema.org standards. It's used by LLMs instead of parsing HTML, and includes a checksum to ensure it hasn't been modified.
{ "@context": "http://schema.org", "@type": "Article", "headline": "LangShake: Revolutionizing LLM Training Data", "description": "A guide to implementing the LangShake protocol...", "articleBody": "...", "author": { "name": "Jane Smith" }, "publisher": { "name": "Example Corp" }, "checksum": "8f7a9b3cf5a...a8b9c07e8f" }
This file serves as a high-level entry point for AI agents. It describes your site, links to structured content, and optionally includes context and a Merkle root for integrity checks.
{ "version": "1.0", "site": { "name": "Example Corp", "language": "en" }, "modules": [ "/langshake/article.json" ], "llm_context": { "summary": "We build open-source AI tools.", "principles": ["Transparency", "Ethics"] }, "verification": { "strategy": "merkle", "merkleRoot": "abc123..." } }
Minimizes HTML parsing with structured JSON metadata, reducing computational overhead and speeding up data collection processes.
Checksum verification ensures trustworthy data by validating content hasn't been modified, creating a more reliable training dataset.
Builds on established Sitemap.xml and Schema.org standards, making adoption straightforward for websites already using these technologies.
CLI tool abstracts complexity; caching speeds up local builds, making implementation straightforward for development teams.
Developers can add nuance for LLMs in a dedicated field, enabling more precise and contextually appropriate AI interactions.
Modular, extendable spec allows for Merkle, LLM summaries, and more, ensuring the protocol can adapt to emerging technologies.
We're inviting experts in AI, web standards, and data engineering to provide feedback and contribute to LangShake's development. Your expertise can help shape the future of LLM training data collection.
Share Your Insights