A Python library for streaming tag search in partially loaded XML from a website

The HTTP-Stream-XML library allows developers to parse XML in HTTP responses in a streaming manner (Chunked transfer encoding). Instead of waiting for the entire response to be received, the library parses the data as it comes in, chunk by chunk.

This strategy comes in handy when dealing with large HTTP responses, specifically those served from slow governmental sites.

If the essential data tags are located at the beginning of the XML file, HTTP-Stream-XML allows you to start processing data as soon as it’s received, potentially saving substantial waiting time.

Rather than waiting for a slow server to send the entire file, you can extract the data you need and complete your data processing tasks faster.

Installation

pip install http-stream-xml

A Practical Illustration of HTTP-Stream-XML

The following get_gene_info function demonstrates how to use the HTTP-Stream-XML library to retrieve gene information from the NCBI entrez API (PubMed), a vast database of biomedical information.

"""Illustrates usage of streamed XML (partial) parsing.

Gets gene's info from NCBI entrez API (PubMed)
https://www.ncbi.nlm.nih.gov/
"""
from typing import Union

from http_stream_xml.entrez import requests_retry_session
from http_stream_xml.xml_stream import XmlStreamExtractor


def get_gene_info(gene_id: Union[str, int]) -> XmlStreamExtractor:
    """Get gene's info from NCBI entrez API (PubMed)."""
    extractor = XmlStreamExtractor(["Gene-ref_desc", "Entrezgene_summary", "Gene-ref_syn"])

    host = "eutils.ncbi.nlm.nih.gov"
    url = f"/entrez/eutils/efetch.fcgi?db=gene&id={gene_id}&retmode=xml"
    request = requests_retry_session().get(
        f"https://{host}{url}", stream=True, verify=False, timeout=60
    )

    fetched_bytes = 0
    for line in request.iter_lines(chunk_size=1024):
        if line:
            fetched_bytes += len(line)
            extractor.feed(line)
            if extractor.extraction_completed:
                break
        print(f"fetched {fetched_bytes} bytes, found tags {extractor.tags.keys()}")

    return extractor


if __name__ == "__main__":
    extractor = get_gene_info("5465")
    print(f"\nResult: {extractor.tags.keys()}\n\n{extractor.tags}")

The get_gene_info function uses HTTP-Stream-XML to retrieve and parse the XML response from the NCBI API. The function initiates an HTTP GET request to the API and streams the response. As the data comes in, the XmlStreamExtractor starts to feed on the incoming chunks, breaking as soon as the extraction is complete.

This approach reduces waiting time significantly, especially when the XML data’s crucial part lies near the beginning.

Wrapping Up

The HTTP-Stream-XML library for Python is indispensable for data extraction when dealing with large XML responses from slow servers.

It offers a significant advantage in situations where the critical data is located at the beginning of the XML file, reducing the waiting time for data retrieval.

Source Code

http-stream-xml