content-judge

22s/run (avg)
1229 tokens/run (avg)
Apache 2.0
View code

Evaluates multiple documents and agent-generated content based on correctness, depth, clarity, and relevance, selecting the highest-scoring one. It ensures optimal document quality for research, content validation, and knowledge refinement.

BeeAI

beeai run content-judge

Try locally in GUI

Example requests

beeai run content-judge '{
  "text": "Generate a concise summary of the history of artificial intelligence.",
  "agents": [
    "gpt-researcher",
    "ollama-deep-researcher"
  ]
}'
Copy

Description

The agent evaluates multiple documents and agent-generated content based on four key criteria - correctness, depth & coverage, clarity & structure, and relevance. It assigns a numerical score (0-1) to each document for each criterion, using a weighted average to determine the highest-scoring document. This ensures that the most accurate, comprehensive, well-structured, and relevant document is selected.

How It Works

The agent accepts two types of input:

Pre-provided documents – Static documents submitted by the user or other agents.
Agent-generated content – Content dynamically retrieved from specified agents in the system.

The agent processes all provided text inputs and evaluates them based on the defined criteria. It then selects the document with the highest weighted score and returns it as the best choice.

Input Parameters

The agent operates based on the following input parameters:

text (string) – The research prompt or query guiding document selection.
documents (array of strings, optional) – A list of pre-provided documents for evaluation.
agents (array of strings, optional) – A list of agents to query for additional content.

If no documents are provided, the agent relies entirely on agent-generated content.

Evaluation Criteria:

Correctness (50%) – Assesses factual accuracy, penalizing misinformation.
Depth & Coverage (10%) – Measures how well the document explores key aspects of the topic.
Clarity & Structure (10%) – Evaluates logical organization and readability.
Relevance (30%) – Determines how well the document aligns with the given research prompt.

The agent utilizes the Llama 3.1 8B model to perform structured evaluations and scoring.

Use Cases

Research Validation – Ensures high-quality, well-researched content by selecting the most reliable sources.
Content Refinement – Helps refine AI-generated content by scoring and selecting the most coherent and accurate version.
Document Summarization Assessment – Evaluates multiple AI-generated summaries and chooses the most comprehensive one.
Quality Assurance for AI Outputs – Ensures AI responses in a pipeline meet accuracy and relevance requirements.