How to parse only small part of XML from NCBI (PubMed) Entrez API
NCBI Entrez problem
For my project I need general gene information from NCBI database.
The problem is Entrez API returns really huge responses, with many megabytes.
The gene summary I need is at the very beginning. So I want only beginning of the response.
How to get just first part of XML and extract information from this invalid partial XML?
In Python xml.sax we can register content handler that will handle all XML tags on the fly. During the XML parsing:
Example how to use such an handler see in xml_stream.py
And we can feed XML chunk by chunk to the xml.sax:
Ok this is parsing part.
What about loading? Well, generally HTTP server could support chunked transfer.
But even if it does not we can just disconnect at the moment we got all the data we need. For example requests supports streaming:
Now you can write all the code by yourself. Or use http-stream-xml: