Skip to content

ontocast.tool.converter

Document conversion tools for OntoCast.

This module provides functionality for converting various document formats into structured data that can be processed by the OntoCast system.

ConverterTool

Bases: Tool

Tool for converting documents to structured data.

This class provides functionality for converting various document formats into structured data that can be processed by the OntoCast system.

Attributes:

Name Type Description
supported_extensions set[str]

Set of supported file extensions.

Source code in ontocast/tool/converter.py
class ConverterTool(Tool):
    """Tool for converting documents to structured data.

    This class provides functionality for converting various document formats
    into structured data that can be processed by the OntoCast system.

    Attributes:
        supported_extensions: Set of supported file extensions.
    """

    supported_extensions: set[str] = {".pdf", ".ppt", ".pptx"}

    def __init__(
        self,
        **kwargs,
    ):
        """Initialize the converter tool.

        Args:
            **kwargs: Additional keyword arguments passed to the parent class.
        """
        super().__init__(**kwargs)
        self._converter = DocumentConverter()

    def __call__(self, file_input: Union[bytes, str]) -> Dict[str, Any]:
        """Convert a document to structured data.

        Args:
            file_input: The input file as either a BytesIO object or file path.

        Returns:
            Dict[str, Any]: The converted document data.
        """
        if isinstance(file_input, bytes):
            ds = DocumentStream(name="doc", stream=BytesIO(file_input))
            result = self._converter.convert(ds)
            doc = result.document.export_to_markdown()
            return {"text": doc}
        else:
            # For non-BytesIO input (like plain text), return as is
            return {"text": file_input}

__call__(file_input)

Convert a document to structured data.

Parameters:

Name Type Description Default
file_input Union[bytes, str]

The input file as either a BytesIO object or file path.

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: The converted document data.

Source code in ontocast/tool/converter.py
def __call__(self, file_input: Union[bytes, str]) -> Dict[str, Any]:
    """Convert a document to structured data.

    Args:
        file_input: The input file as either a BytesIO object or file path.

    Returns:
        Dict[str, Any]: The converted document data.
    """
    if isinstance(file_input, bytes):
        ds = DocumentStream(name="doc", stream=BytesIO(file_input))
        result = self._converter.convert(ds)
        doc = result.document.export_to_markdown()
        return {"text": doc}
    else:
        # For non-BytesIO input (like plain text), return as is
        return {"text": file_input}

__init__(**kwargs)

Initialize the converter tool.

Parameters:

Name Type Description Default
**kwargs

Additional keyword arguments passed to the parent class.

{}
Source code in ontocast/tool/converter.py
def __init__(
    self,
    **kwargs,
):
    """Initialize the converter tool.

    Args:
        **kwargs: Additional keyword arguments passed to the parent class.
    """
    super().__init__(**kwargs)
    self._converter = DocumentConverter()