How to parse only small part of XML from NCBI (PubMed) Entrez API

NCBI Entrez problem

For my project I need general gene information from NCBI database.

The problem is Entrez API returns really huge responses, with many megabytes.

The gene summary I need is at the very beginning. So I want only beginning of the response.

How to get just first part of XML and extract information from this invalid partial XML?

In Python xml.sax we can register content handler that will handle all XML tags on the fly. During the XML parsing:

parser = xml.sax.make_parser()
parser.setContentHandler(your_stream_handler)

Example how to use such an handler see in xml_stream.py

And we can feed XML chunk by chunk to the xml.sax:

parser = xml.sax.make_parser()
parser.feed(chunk)

Ok this is parsing part.

What about loading? Well, generally HTTP server could support chunked transfer.

But even if it does not we can just disconnect at the moment we got all the data we need. For example requests supports streaming:

import requests

r = requests.get('https://httpbin.org/stream/20', stream=True)
for line in r.iter_lines():
    print(line)

Now you can write all the code by yourself. Or use http-stream-xml:

from httpstreamxml import entrez

print(entrez.genes['myo5b'][entrez.GeneFields.description])
Written on July 5, 2019