N2000 3 – Getting Data
To be able to analyze the data from Natura 2000, the first step is to get the data from the Natura 2000 page (,,,,,,). When acquiring data from any source, it is important to keep in mind that everything you do should be transparent and traceable. This can be done by either documenting each step properly or making an easy-to-read script that handles the steps involved, or both. Why is it important to ensure transparency and traceability? The most important reasons are to be able to prove that the basis for your analysis comes from the data sources that you say they come from and to be able to spot possible errors in the analysis.
A great primer on the topics in this article is the Coursera course Getting and Cleaning Data from Johns Hopkins University.
Choosing a Strategy
Your choice of strategy should – first and foremost – depend on your goals for the analysis; our goal has already been defined as answering the questions in Natura 2000 Part 2 – Asking the Right Question(s). Secondly, you should look at the complexity of the operations for getting the data and how frequently you do these operations. High frequency and a lot of complexity calls for a high degree of automation. However, if it is something you do only once every now and then and the operations are fairly easy, maybe spending a lot of time programming just isn’t worth the effort.
Other considerations are stability and size. Do you run the risk of the data being changed in some fundamental way, say the underlying structure of the data, or of the data becoming unaccessible? Or is the data a database that is publicly available and that has been around forever and will continue to be accessible for the foreseeable future? Is the data of such a size that a download and local storage is a possibility?
So what is the strategy that we should choose in the case of Natura 2000? Let’s first have a closer look at the data available to us. We are looking at data on species in given habitats around Europe across several years. Lets have a closer look at the data, beginning with the most recent data (from 2015) . From the “About this data”-section, we find several important facts:
- Temporal frame: We have data from 2010 onwards. The data is sometimes published twice a year (middle and end of year), sometimes once a year (end of year).
- Changes: The data does not change once it is reported. The accessibility and location of the data would probably not change, although the latter is not guaranteed.
- Format: There are two different file formats; MS Access database-files and / or csv-files.
- Size: 11 tables, some totaling over 300.000 entries. Total file size for each year up to around 300 and 150 Mb for Access and csv-formats respectively. Single csv tables up to around 50 Mb. Year-on-year data from 2010 to 2015 total 72 files and 2.5 GB of data.
- NULL-values: There are a lot of values that are NULL, for example, where nothing has been registered on a specific species at a specific site.
- Austria didn’t like the way the species were registered, so they do not register anything with Natura2000.
We can draw the following conclusions from these facts:
- Download: As the data doesn’t change once registered and the file sizes are not too big, we can download the data and store it locally (and on GitHub).
- Year-on-year only: Since the data is only intermittently published twice a year, we choose to compare end-of-year data only, as they are reported each year.
- Format: We got two choices; either go with the Access-format through the whole analysis, or convert the first year to CSV-tables. We will have a closer look at this in Part 4.
- Size of tidy tables: It is clear that the tidy data will be much smaller in size than the original data sets, as there is a lot of empty values.
- Austria: We cannot include values from Austria.
The next step in the project is to get hold of all the data. I have downloaded all the ZIP-files and expanded these into local folders. If you look at the files, you will see that the Access-files have different names depending on the year, while all CSV-files have identical filenames from one year to another. To keep files separate and ordered, I made a folder called “data-original” and subfolders for each year.
Working with Git and GitHub
I use Git and GitHub to keep track of changes in the project. I have found Git to be extremely useful in data science projects, among other things because it is easy to roll changes back and forth, and it is easy to share the project with others. The project is available here.
The data has been downloaded from the Natura2000-page and simply added to the Git-directory locally. Afterwards, I added all the files, staged them and committed the changes. If these are terms you don’t recognize, have a look at this introduction to Git. If you have used Git before, but don’t remember the commands, the Git reference might be useful.
Note: The Access-files are so large that you need a Git-extension to get it to work with GitHub. The normal maximum file size on GitHub is 100 Mb, and you get an error message if you try to commit files that are bigger than that. However, the extension Git Large File Storage (LFS) is a way around that limitation. It is really easy to install and use. Just follow the instructions on the Git Large File Storage (LFS) page.
Next: Cleaning Data (coming soon).
 Natura2000 data – Values for 2010 – http://www.eea.europa.eu/data-and-maps/data/natura-1 – Downloaded on October 11, 2016.
 Natura2000 data – Values for 2011 – http://www.eea.europa.eu/data-and-maps/data/natura-2 – Downloaded on October 11, 2016.
 Natura2000 data – Values for 2012 – http://www.eea.europa.eu/data-and-maps/data/natura-3 – Downloaded on October 11, 2016.
 Natura2000 data – Values for 2013 – http://www.eea.europa.eu/data-and-maps/data/natura-5 – Downloaded on October 11, 2016.
 Natura2000 data – Values for 2014 – http://www.eea.europa.eu/data-and-maps/data/natura-6 – Downloaded on October 11, 2016.
 Natura2000 data – Values for 2015 – http://www.eea.europa.eu/data-and-maps/data/natura-7 – Downloaded on October 11, 2016.
 Natura2000 data – Values for 2016 – https://www.eea.europa.eu/data-and-maps/data/natura-8 – Downloaded November 9, 2017.