fix(builder): handle both dict and string formats for core_entities in knowledge unit extractor #717

unidel2035 · 2025-11-01T16:13:11Z

🎯 Summary

This PR fixes the AttributeError: 'dict' object has no attribute 'split' error that occurs during knowledge extraction, as reported in issue #714.

🐛 Problem

The knowledge unit extractor was failing when processing core_entities field that comes in dictionary format from the LLM, because the code only expected string format and attempted to call .split(",") on the value.

Root Cause

The LLM can return core_entities in two different formats depending on the language and prompt:

String format (Chinese example): "核心实体": "火电发电量,同比增长率,2019年"
Dict format (English example): "Core Entities": {"T.I.": "Person", "No Mediocre": "Culture and Entertainment"}

The code at kag/builder/component/extractor/knowledge_unit_extractor.py:587 only handled the string format:

for item in knowledge_value.get("core_entities", "").split(","):
    # This fails when core_entities is a dict!

Error Stack Trace

AttributeError: 'dict' object has no attribute 'split'
  File "/kag/builder/component/extractor/knowledge_unit_extractor.py", line 587, in assemble_knowledge_unit
    for item in knowledge_value.get("core_entities", "").split(","):

✅ Solution

Modified the assemble_knowledge_unit method in knowledge_unit_extractor.py to handle both formats gracefully:

core_entities_raw = knowledge_value.get("core_entities", "")

# Handle both string and dict formats for core_entities
if isinstance(core_entities_raw, dict):
    # Dict format: {entity_name: entity_type}
    core_entities = core_entities_raw
elif isinstance(core_entities_raw, str):
    # String format: comma-separated values
    for item in core_entities_raw.split(","):
        if not item.strip():
            continue
        core_entities[item.strip()] = "Others"
else:
    # Handle unexpected types gracefully with logging
    logger.warning(
        f"Unexpected type for core_entities: {type(core_entities_raw)}, "
        f"expected str or dict. Value: {core_entities_raw}"
    )

🧪 Testing

Experiment Scripts: Created comprehensive test scripts in experiments/ directory to verify the fix handles all scenarios:
- String format (Chinese)
- Dict format (English)
- Empty strings
- Missing fields
- Invalid types (with proper logging)
Unit Tests: Added test_knowledge_unit_core_entities.py with comprehensive test coverage for all core_entities formats
Code Quality: All changes pass flake8 validation

📝 Changes

Modified: kag/builder/component/extractor/knowledge_unit_extractor.py - Added type checking and handling for both dict and string formats
Added: tests/unit/builder/component/test_knowledge_unit_core_entities.py - Unit tests for the fix
Added: experiments/test_core_entities_handling.py - Experiment script demonstrating the issue and fix
Added: experiments/test_fix.py - Verification script for all scenarios

🔗 Related Issues

Fixes #714

🤖 Generated with Claude Code

Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: undefined

…n knowledge unit extractor Fix AttributeError when LLM returns core_entities as dict instead of string. The code now handles both formats: - String format (Chinese): "entity1,entity2,entity3" - Dict format (English): {"entity1": "Type1", "entity2": "Type2"} This resolves the issue where knowledge extraction would fail with: AttributeError: 'dict' object has no attribute 'split' Fixes OpenSPG#714 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This reverts commit acb7361.

unidel2035 · 2025-11-01T16:22:56Z

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

📎 Log file uploaded as GitHub Gist (359KB)
🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

unidel2035 and others added 2 commits November 1, 2025 16:12

Initial commit with task details for issue OpenSPG#714

acb7361

Adding CLAUDE.md with task information for AI processing. This file will be removed when the task is complete. Issue: undefined

unidel2035 changed the title ~~[WIP] 知识抽取出现问题了，怎么解决~~ fix(builder): handle both dict and string formats for core_entities in knowledge unit extractor Nov 1, 2025

unidel2035 mentioned this pull request Nov 1, 2025

知识抽取出现问题了，怎么解决 #714

Open

unidel2035 marked this pull request as ready for review November 1, 2025 16:22

Revert "Initial commit with task details for issue OpenSPG#714"

fbbff98

This reverts commit acb7361.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(builder): handle both dict and string formats for core_entities in knowledge unit extractor #717

fix(builder): handle both dict and string formats for core_entities in knowledge unit extractor #717

Uh oh!

unidel2035 commented Nov 1, 2025 •

edited

Loading

Uh oh!

unidel2035 commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(builder): handle both dict and string formats for core_entities in knowledge unit extractor #717

Are you sure you want to change the base?

fix(builder): handle both dict and string formats for core_entities in knowledge unit extractor #717

Uh oh!

Conversation

unidel2035 commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 Summary

🐛 Problem

Root Cause

Error Stack Trace

✅ Solution

🧪 Testing

📝 Changes

🔗 Related Issues

Uh oh!

unidel2035 commented Nov 1, 2025

🤖 Solution Draft Log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

unidel2035 commented Nov 1, 2025 •

edited

Loading