Skip to content

Latest commit

 

History

History
126 lines (84 loc) · 6.55 KB

File metadata and controls

126 lines (84 loc) · 6.55 KB

Day 2: Migrations and PostGIS queries

The (failed) data migration of the 50 free toilets

My plan was to perform a data migration to update the price column on my toilets table if a toilet was included in the dataset of 50 free-of-charge toilets. I'm skipping this for now.

The reason is that the dataset does not contain toilet ID's. I've made an attempt to pattern match via my toilets.address column with the Standort column in the dataset. However, the Standort entries match the address column only very irregularly, so my attempt of doing pattern matching in 03-migrate-free-toilets.sql was not successfull. In an ideal scenario, the toilet ID's would be provided. I might come back to the pattern matching approach at a later point (just for the exercise), but it is no priority.

Migrations and schemas - some notes

The migra tool has a little explainer about the differences between these two approaches for the source of truth of the database schema.

My learning from this is that (for my context) I find it preferrable to leave the schema truth with the app. I would usually not manage a database without an application that uses it, so the two would be very intertwined anyways.

A, for me, typical application would be a Rails app that has a very simple and mature ORM (Active Record) for handling migrations and the schema.

Another interesting tool for handling schemas is the built-in Schema Diff tool of pgAdmin.

PostGIS exercises

Next, we want to take a look at the capabilities of PostGIS. PostGIS's website has a nice section of spatial exercises that demonstrates different functions. For the next part I will take inspiration from there.

Here are some questions I would like to answer using PostGIS:

Thinking about these exercises and going through the PostGIS examples, I am realizing that I need more tables with spatial data, so that the exercises become more interesting.

In order to make some more interesting queries, we are first going to import some more data. I chose to use part of the LOR data of Berlin (for compliance with the CC-BY-3.0 licence: The data originates from Amt für Statistik Berlin-Brandenburg). I have chosen this dataset for no particular reason, it's just to have some polygons in my PostGIS database. The data is available as a GeoJSON, so this will be another challenge: How can we import data from a GeoJSON into a PostGIS table?

Importing the LOR GeoJSON

Okay, I thought this would be harder. It's not!

There is a command line tool called ogr2ogr which (if run on the same server as the PostGIS database) is able to seamlessly import a GeoJSON file into a PostGIS table.

I have used the following query to import the file's contents:

ogr2ogr -f "PostgreSQL" PG:"dbname=berlin_toilets_app user=<me>" /path/to/repo/assets/data/lor_planungsraeume_2021.geojson -nln lors

-nln is the table name that should be created for the data.

Now, this imported the data with some not-so-ideal column names, but for the purpose of this exploration I'm going to ignore this. (The type detection by the ogr2ogr is great, by the way.)

The challenge of different projections

PostGIS has a great documentation on the different types of map projections. If we inspect our spatial data, we will notice an issue with our projection(s):

SELECT ST_SRID(geometry) FROM toilets LIMIT 1;
-- Returns: 4326
SELECT ST_SRID(wkb_geometry) FROM lors LIMIT 1;
-- Returns: 25833

Our LOR data does not use the same projection as our toilet data. How can we deal with that issue?

Actually PostGIS gives us a ST_Transform function which allows us to transform a geometry from one projection to the other.

We are not gonna touch the original projection in the lors table, instead we're going to transform the projection on the fly...

Finding the LOR with the highest toilet count

-- Note that I'm using ST_Transform here so that pgAdmin is able to map the results onto their standard 4326 projection map
SELECT lors.plr_name, COUNT(toilets.*), ST_Transform(lors.wkb_geometry,4326)
FROM lors
JOIN toilets ON ST_Contains(ST_Transform(lors.wkb_geometry,4326), toilets.geometry::geometry)
GROUP BY lors.plr_name, lors.wkb_geometry
ORDER BY COUNT(toilets.*) DESC
LIMIT 10;

-- "Olympiagelände"          13
-- "Großer Tiergarten"        6
-- "Krumme Lanke"             5
-- "Carl-Schurz-Straße"       5
-- "Nikolassee"               5
-- "Wannsee"                  5
-- "Reichenberger Straße Ost" 5
-- "Alt-Tegel"                5
-- "Barnimkiez"               4
-- "Grünau"                   4

Berlin LOR's with the highest toilet counts

With this query we can see that the LOR Olympiagelände has by far the most toilets.

Also intersting that the south-western LORs have high toilet counts as well.

Notice that this is a rather expensive query and takes a while to complete. Is there a way that we can speed this up (e.g. with an index)?

Interestingly, calculating the area for a LOR almost matches the groesse_m2 number. I wonder where this difference is coming from?

SELECT groesse_m2, ST_Area(wkb_geometry) FROM lors LIMIT 1;
-- 2294021.723  2292754.92056915

Returning a GeoJSON representation of our query results

It might be interesting to work with the query result of the LORs with the highest toilet count outside of SQL. For that we can use the ST_AsGeoJSON function.

SELECT json_build_object(
    'type', 'FeatureCollection',
    'features', json_agg(ST_AsGeoJSON(t.*)::json)
    )
FROM (
	SELECT lors.plr_name, COUNT(toilets.*), ST_Transform(lors.wkb_geometry,4326)
	FROM lors
	JOIN toilets ON ST_Contains(ST_Transform(lors.wkb_geometry,4326), toilets.geometry::geometry)
	GROUP BY lors.plr_name, lors.wkb_geometry
	ORDER BY COUNT(toilets.*) DESC
	LIMIT 10
     ) as t(name, toilet_count, geom);

Note that we simply wrap our previous query inside another query that makes sure that we build a JSON object.


We continue on day 3.