Automating EUROSTAT in Stata – Part 2: Formatting data files

Once the Eurostat files have been downloaded using the procedures explained in the earlier post, the next step is to clean up the files in a two stage process to fix the following two issues:

  1. Names of variables.
  2. Getting rid of text fields in numeric fields.

The Stata syntax is as follows:

// reading the file

insheet using ./raw/nama_gdp_c.tsv, tab clear

// cleaning the first column

split v1, p(,) gen(a)
drop v1
order a*
// cleaing the data variables
ds v*
di “`r(varlist)'”

local vlist `r(varlist)’

// removing all non-numeric characters
foreach i of local vlist {
     cap replace `i’ = regexr(`i’, “:”, “”)
     cap replace `i’ = regexr(`i’, “[a-z]”, “”)

// for specific files with extra characters. This list can be expanded to include any other type of non-numeric characters      
     cap replace `i’ = regexr(`i’, “e”, “”)
     cap replace `i’ = regexr(`i’, “p”, “”)
     cap replace `i’ = regexr(`i’, “r”, “”)


// fixing a bad header and the “a” columns

order a* v*

ds a*
di “`r(varlist)'”
local alist `r(varlist)’

foreach j of local alist {
     replace `j’=”geo” if `j’==”geo\time”
     local header = `j'[1]
     ren `j’ `header’
// fixing the v columns
ds v*
di “`r(varlist)'”
local vlist `r(varlist)’

foreach k of local vlist {
     local header = `k'[1]
     ren `k’ y`header’
// check headers with the first row and then drop the first row
drop in 1

destring _all, replace


save ./split/nama_gdp_c.dta, replace    // save the files in the split directory

The whole process can be looped over multiple files by defining a macro for various files.

Other posts in this series:
Part 3: Reshaping files
Part 1: Downloading files

2 thoughts on “Automating EUROSTAT in Stata – Part 2: Formatting data files

  1. Hi,
    I am trying to run your code but I am having an issue when removing the non numeric chars from v2 to v8.
    Namely nothing happens while the code still runs successfully. I didn’t modify the code but the observation with : or chars do not change… any hint? (All dataset is currently looking as a string)

    Many thanks

    1. Hi Mik,

      You try the following three steps:
      1. Make sure you replace the quotation marks “” in the dofile if you have copy pasted the code from above. Some web fonts have different symbols for the same character.

      2. Make sure all the code is running in one go especially if you have locals defined. Otherwise the code still runs but nothing happens.

      3. Do a quick eyeball to see if there are additional non-numeric characters. A quick way of doing this is copying a string variable and force destringing it. Here is an example code:
      gen temp = strvariable
      destring temp, force
      br strvariable if temp==. & strvariable!=””

      If it doesn’t solve the issue, you can mail your code and i can check it.



Leave a Reply to Mik Cancel reply

Your email address will not be published.