`graphcast.cli.xml2json`¶

XML to JSON conversion tool for data preprocessing.

This module provides a command-line tool for converting XML files to JSON format, with support for different data sources and chunking options. It's particularly useful for preprocessing scientific literature data from sources like Web of Science and PubMed.

Key Features

Support for Web of Science and PubMed XML formats
Configurable chunking for large files
Batch processing of multiple files
Customizable output format

Example

$ uv run xml2json --source-path data/wos.xml --chunk-size 1000 --mode wos_csv

`do(source_path, chunk_size, max_chunks, mode)` ¶

Convert XML files to JSON format.

This command processes XML files and converts them to JSON format, with support for different data sources and chunking options.

Parameters:

Name	Description	Default
`source_path`	Path to source XML file or directory	required
`chunk_size`	Number of records per output file (default: 1000)	required
`max_chunks`	Maximum number of chunks to process (default: None)	required
`mode`	Data source mode ('wos_csv' or 'pubmed')	required

Example

$ uv run xml2json --source-path data/wos.xml --chunk-size 1000 --mode wos_csv

Source code in graphcast/cli/xml2json.py

@click.command()
@click.option(
    "-s",
    "--source-path",
    type=click.Path(path_type=pathlib.Path),
    required=True,
)
@click.option("-c", "--chunk-size", type=int, default=1000)
@click.option("-m", "--max-chunks", type=int, default=None)
@click.option("--mode", type=str)
def do(source_path, chunk_size, max_chunks, mode):
    """Convert XML files to JSON format.

    This command processes XML files and converts them to JSON format, with support
    for different data sources and chunking options.

    Args:
        source_path: Path to source XML file or directory
        chunk_size: Number of records per output file (default: 1000)
        max_chunks: Maximum number of chunks to process (default: None)
        mode: Data source mode ('wos_csv' or 'pubmed')

    Example:
        $ uv run xml2json \\
            --source-path data/wos.xml \\
            --chunk-size 1000 \\
            --mode wos_csv
    """
    if mode == "wos_csv":
        pattern = r"xmlns=\".*[^\"]\"(?=>)"
        force_list = force_list_wos
        tag = tag_wos
    elif mode == "pubmed":
        pattern = None
        force_list = None
        tag = "PubmedArticle"
    else:
        raise ValueError(f"Unknown mode {mode}")

    if source_path.is_dir():
        files = [
            fp for fp in source_path.iterdir() if not fp.is_dir() and "xml" in fp.name
        ]
    else:
        files = [source_path] if ".xml." in source_path.name else []
    for fp in files:
        target_root = str(fp.parent / fp.name.split(".")[0])

        convert(
            fp,
            target_root=target_root,
            chunk_size=chunk_size,
            max_chunks=max_chunks,
            pattern=pattern,
            force_list=force_list,
            root_tag=tag,
        )

graphcast.cli.xml2json¶

do(source_path, chunk_size, max_chunks, mode) ¶

`graphcast.cli.xml2json`¶

`do(source_path, chunk_size, max_chunks, mode)` ¶