A Python library for streaming tag search in partially loaded XML from a website
The HTTP-Stream-XML library allows developers to parse XML in HTTP responses in a streaming manner (Chunked transfer encoding). Instead of waiting for the entire response to be received, the library parses the data as it comes in, chunk by chunk.
This strategy comes in handy when dealing with large HTTP responses, specifically those served from slow governmental sites.
If the essential data tags are located at the beginning of the XML file, HTTP-Stream-XML allows you to start processing data as soon as it’s received, potentially saving substantial waiting time.
Rather than waiting for a slow server to send the entire file, you can extract the data you need and complete your data processing tasks faster.
Installation
pip install http-stream-xml
A Practical Illustration of HTTP-Stream-XML
The following get_gene_info function demonstrates how to use the HTTP-Stream-XML library to retrieve gene information from the NCBI entrez API (PubMed), a vast database of biomedical information.
"""Illustrates usage of streamed XML (partial) parsing.
Gets gene's info from NCBI entrez API (PubMed)
https://www.ncbi.nlm.nih.gov/
"""
from typing import Union
from http_stream_xml.entrez import requests_retry_session
from http_stream_xml.xml_stream import XmlStreamExtractor
def get_gene_info(gene_id: Union[str, int]) -> XmlStreamExtractor:
"""Get gene's info from NCBI entrez API (PubMed)."""
extractor = XmlStreamExtractor(["Gene-ref_desc", "Entrezgene_summary", "Gene-ref_syn"])
host = "eutils.ncbi.nlm.nih.gov"
url = f"/entrez/eutils/efetch.fcgi?db=gene&id={gene_id}&retmode=xml"
request = requests_retry_session().get(
f"https://{host}{url}", stream=True, verify=False, timeout=60
)
fetched_bytes = 0
for line in request.iter_lines(chunk_size=1024):
if line:
fetched_bytes += len(line)
extractor.feed(line)
if extractor.extraction_completed:
break
print(f"fetched {fetched_bytes} bytes, found tags {extractor.tags.keys()}")
return extractor
if __name__ == "__main__":
extractor = get_gene_info("5465")
print(f"\nResult: {extractor.tags.keys()}\n\n{extractor.tags}")
The get_gene_info function uses HTTP-Stream-XML to retrieve and parse the XML response from the NCBI API. The function initiates an HTTP GET request to the API and streams the response. As the data comes in, the XmlStreamExtractor starts to feed on the incoming chunks, breaking as soon as the extraction is complete.
This approach reduces waiting time significantly, especially when the XML data’s crucial part lies near the beginning.
Wrapping Up
The HTTP-Stream-XML library for Python is indispensable for data extraction when dealing with large XML responses from slow servers.
It offers a significant advantage in situations where the critical data is located at the beginning of the XML file, reducing the waiting time for data retrieval.