In general, we recommend fetching the full dataset you want in one go. Because the data
is being streamed to you, you should be able to read the stream line-by-line, and that way
be able to handle any memory issues. However, if for whatever reason this isn't suitable for
your use case it is possible to paginate results using the limit
query parameter in conjunction
with the offset
query parameter. This will make it possible to fetch the whole dataset across
multiple requests.
There is one inherent risk when using pagination: There is no guarantee that the dataset will remain unchanged while paginating through the results. However, if no data manipulation operations, i.e., inserts, updates, or deletes, are done while you paginate, the results should be consistent.
Furthermore, if you apply an offset
as query parameter, the dataset must be sorted in order to give consistent results.
This will slow down your request even if you do not specify a sort order, (in which case sorting
is done on the primary key of the dataset your requesting.)
Note: Not all Datasets API endpoints support the
limit
andoffset
query parameter, or may support it in an unexpected way. As of May 2020, this limitation applies to thecompany-flag
andcompany-flag-composite
endpoints of the Datasets API. These two endpoints do not support theoffset
query parameter at all, and supports thelimit
query parameter only partially. In fact, thelimit
parameter is not even documented in the automatic API documentation, as it is not intended for production systems, only for testing purposes. It will limit the number returned entries, but may return up to a fixed multiple of requested limit due to technical limitations.
Alright, enough caveats. Let's paginate over all companies with an organization number starting
with 9175406
. We can do this using the LIKE
dot notation filter operator.
companies = None
offset = 0
limit = 6
while companies is None or companies:
print("From entry", offset, "to entry", offset + limit)
response = requests.get(
"https://api.enin.ai/datasets/v1/dataset/company-composite",
params={
"response_file_type": "jsonl",
"company.org_nr_schema": "EQ:NO",
"company.org_nr": "LIKE:9175406%",
"keep_only_fields": "company.uuid,company.name,company.org_nr",
"order_by_fields": "company.name",
'limit': limit,
'offset': offset,
},
auth=auth,
)
companies = response.content.decode()
print(companies)
if response.status_code != 200:
break
offset += limit
Notice how we increment the offset by the limit size. This ensures you are getting disjoint
sets of data. Also, notice that we break out of the loop if we get anything else than a
200 HTTP status code, i.e., there is probably a problem and we might have entered an
infinite loop. Finally, notice how we sort by company.name
, this isn't actually enough to
ensure consistent ordering, so under-the-hood another ordering is added on company.uuid
as it is the primary key of the subject entity, company
.
The pagination loop runs three times, two times with data, and one where the result is empty:
From entry 0 to entry 6
{"company": {"name": "ANNE BIRGITTE NESSE", "uuid": "1ca93ab6-8507-488e-b903-888a4cf22b12", ...
{"company": {"name": "ÅRNES KORNSILO & MØLLE BA", "uuid": "10b32f80-c30f-443a-bc33-528b112e ...
{"company": {"name": "ENIN AS", "uuid": "f0e01f1e-f977-429f-ac12-0f0418901bfe", "org_nr": " ...
{"company": {"name": "HANNA BERGH", "uuid": "fcbde1ab-db34-4757-ac66-85c2ccd7db2b", "org_nr ...
{"company": {"name": "LINDA DAHL INVEST AS", "uuid": "7f71497e-e02c-459c-ae09-1463178bdfca" ...
{"company": {"name": "LOOMIAN ENTERTAINMENT BERG", "uuid": "5dc21c97-25ac-46a8-91d5-a9d9acc ...
From entry 6 to entry 12
{"company": {"name": "NP CONSULT SP.Z O.O", "uuid": "d3141183-eeff-4dbe-84af-25cc15876f6e", ...
{"company": {"name": "PEARL OF THE ORIENT ANABELLE DELOS SANTOS TVERÅ", "uuid": "defdb0b1-f ...
{"company": {"name": "TØRRES AVIATION", "uuid": "5faf1564-43d6-46ba-95d0-9989fa3c3df1", "or ...
From entry 12 to entry 18
When the data is empty it breaks out of the loop, and we are done.