Historically CEEMID started out as the Central and Eastern European Music Industry Databases out of necessity following a CISAC Good Governance Seminar for European Societies in 2013. The adoption of European single market and copyright rules, and the increased activity of competition authority and regulators required a more structured approach to set collective royalty and compensations tariffs in a region that was regarded traditionally as data-poor with lower quantity of industry and government data sources available.

In 2014 three societies, Artisjus, HDS and SOZA realized that need to make further efforts to modernize the way they measure their own economic impact, the economic value of their licenses to remain competitive in advocating the interests vis-à-vis domestic governments, international organizations like CISAC and GESAC and the European Union. They signed a Memorandum of Understanding with their consultant to set up the CEEMID databases and to harmonize their efforts.

  • The first Hungarian Music Industry Report, a 144-pages business strategy and policy advocacy report, which became the basis of annual reports in the Hungarian music industry.

  • The first Slovak Music Industy Report, a 227-pages advocacy report with business strategy and evidence-based policy recommendations. Several royalty pricing and other fact-based industry work was commissioned by Slovak stakeholders which are not publicly available.

  • Private Copying in Croatia is an advocacy report for re-setting the remuneration of private copying, and measuring the value transfer to media platforms such as YouTube. In Hungary, more technical and detailed reports were made for Artisjus, Mahasz, EJI, Hungart and Filmjus, which are not available to the public.

  • CEEMID was used in various quantitative ex ante granting assessments, in royalty price setting, in calculating private copying remuneration, predicting audiences, and other evidence-based policy projects.

  • See the CEEMID Documentation Wiki for more information about data coverage and methodology.


Reprex B.V. is a reproducible research company that tries to put CEEMID’s intellectual property in the data mapping, open data, automated research, and its 2000 cultural and creative sector indicators on a sustainable business model. We believe that CEEMID created a globally unique data program which had too few users and too ad hoc and scarce (private) funding that made a great product financially infeasible.

We believe that whenever a business or policy consulting team, a research institute, or data journalism team has already used, formatted, and analyzed data from an external source at least twice, this procedure should be automated. This makes it error-free, well documented, cheap and re-usable. Furthermore, making data collection ongoing instead being ad hoc saves data acquisition, validation and supervision costs. We would like to help medium-sized business, policy, NGO, scientific and data journalism organizations in this, who do not have the institutional capacity to hire data scientists and engineers.

We made critical elements of our software product peer-reviewed open source statistical software, and on the basis of these elements we created a minimum viable product.

Our minimum viable product offered in two forms: data-as-service, and solution-as-service, was offered to three customer segments with similar research problem (business/policy consultancies, university research institutes, and data journalism teams). We have not received a single refusal, and we already have contractual or letter of intent commitments, several of our prospects immediately referred us further.

They have committed in forms of contracts or letters of intents to incorporate it into planned research activities and some subscription products, and some of them have already assigned budgets and resources to these projects. While we see, based on our teams experience in these segments, that the problems and workflows that we support are very similar, the business/funding model of these three segments is very different. We create value by continuous automation, which is a different cost structure than our customer’s project/grant based ad-hoc founding, and we are working on a cooperation model that bridges these differences to exploit the highest value proposition. We find it very encouraging that so far none of our targets refused our offer, and several of them immediately referred us further.

Data sources

html grew out of a collaborative observatory, CEEMID. CEEMID is aiming to transfer thousands of indicators and a verifiable, open-source software that creates them to the European Music Observatory to give Europe-wide acces timely, reliable, actionable statistics and indicators for the music industry, policymakers and music professionals. (Read more about our data coverage)

Reprex is aiming to support this transition, and at the same time, create new data products for other creative industries. See our call for partners.

Shared data resources

Other data resources

Data integration

From the originally envisioned, centralized, permission-based data structure, due to practical considerations, CEEMID switched to a more flexible, decentralized approach. This approach is based on continuous data integration, which requires permissions to use business confidential information only in use. This allowed a rapid extension of CEEMID to the whole of Europe and go even beyond. As a result of continous data integration it already includes hundreds of indicators foreseen in all pillars of the planned European Music Observatory.

While CEEMID is aware of and uses the metadata of CISAC’s, IFPI’s, EAO’s, and other industry sources’ data, it does not contain this data, only when a user with permission for the use of these industry sources requires the integration of such data with other CEEMID data, or user-specific data. While this approach makes sharing results more cumbersome, it provided a path to increase the number of useful indicators from a few dozens to around a thousand. Furthermore, it exponentially increases the value of CISAC’s, IFPI’s or EAO’s data, especially when designing better royalty rates, or creating economic evidence for litigation. Take a look at a simple, non-confidential example blog post.

We will integrate data into open data products and music industry intelligence apps from the following sources:

  • Nationally representative CAP surveys of music users and film viewers.

  • Anonymous CEEMID Music Professional Surveys and CEEMID Audiovisual Professional Surveys about their work, incomes and costs. See example blog post.

  • Big data sources from various geolocational applications about events and location visits small video.

  • Automatic data retrieval from open data sources, including statistical data and EU-funded research. See example blog post


Our reproducible researech workflow is based on the statistical programming language R (R Core Team 2020). R is the open source version of the statistical programming language S, and it is widely used in national statistical offices (Templ and Todorov 2016), we believe that it is the 21st century lingua franca of statistics. It is very-high level, non-compiled language that is very easy to use, modify, and even single lines of code can be executed. In other words, it is very well suited for literate programming, i.e. human-readable program codes that help peer-review.

As members of the rOpenGov initative, we actively contribute and create various open-source “packages”, or software libraries that allow a reproducible access to open data. The eurostat (Lahti et al. 2020, 2017) package allows API access and basic processing to the Eurostat data warehouse. Because of Eurostat’s problematic regional statistics, we amended it with further software that became the (Daniel Antal 2020b) package. For the use of symmetric input-output tables in economic impact assessments we created the (Daniel Antal 2020a) package — because these data resources cannot be used without further processing from the Eurostat warehouse.

In order to create data products that can be easily used on any personal computer, spreadsheet application, statistical software, or inserted to a relational database, our data must comply with the statistical definition of tidy data (Wickham 2014). Our indicators are usually go through the tidyverse software packages of dplyr: A Grammar of Data Manipulation and tidyr: Tidy Messy Data (Wickham et al. 2020; Wickham and Henry 2020) and the accompanying purrr: Functional Programming Tools (Henry and Wickham 2020). For the analyst, it brings R very close to SQL, to the point that you can write mixed R/SQL scripts.

For the reproducible creation of this data catalogue we use (Xie 2020a), a book-form dynamic reporting tool based on (Xie 2020b, 2015). They are all based on (Allaire et al. 2020; Xie, Allaire, and Grolemund 2018), which allows the combination of marked-up text for various text outputs such as PDF, html, e-books, Word documents with program codes in the R, Phython, C++ languages. We are also welcome contributions in R, Python or C++.

The use of open source software and the open source R statistical language allows a continuous peer-review of data ingestion, processing, corrections and indicator creation by statisticians, data scientists and academics. For example, this allowed us to compare test results on calculating economic impact indicators for the creative industries and other industries with the UK statistical office.

Data processing

  • iotables is a reproducible research tool that is able to work with national accounts and create some satellite accounts for all EU member states. It was originally developed to calculate the economic impacts of the Hungarian tax shelter before renewal (state aid notification at DG Competition) and for the Slovak Music Industry Report, which used similar methodology to prove that CCS sectors are overtaxed in the country. The iotables open source statistical software library is used by about 800 practicioners in the world.

  • regions solves the problem that Europe’s regional boundaries have changes in several thousand places over the last 20 years, and therefore member states and Eurostat’s regional statistics are not comparable for more than 2-3 years. This software validates, and where possible, changes the regional coding from NUTS1999 till the not yet used NUTS2021, opening up . It was originally designed in a research project at IVIR in the University of Amsterdam to understand the geographical dynamics of book piracy. Because of the needs this software fills, it had 700 users in the first month after publication.

  • retorhamonize is a software that allows the programmatic retrospective harmonization of surveys, such as the last 35 years of all Eurobarometer microdata, or all Afrobarometer microdata. Eurobarometer grew out of the need of certain CEE member states’ need to get comparable data about their music and audiovisual sectors. We commissioned surveys following ESSNet-Culture guidelines, and joined their data with open access European microdata-level surveys.

These tools, and other similar tools that Daniel develops, allow testing the ideas and documenting them. These follow the concept of open collaboration (with other statistical, academic, national member state data sources) and reproducible research.

Survey harmonization

The Final Report of the Woking Group European Statistical System Network on Culture (in short: ESSnet-Culture) (Bína, Vladimir et al. 2012) contains a rather detailed guideline in the report of the Task Force on Cultural Practices And Social Aspects Of Culture, which contains a very mature social scientific model to measure participation, and survey methodology and samples description how to carry out

The Final Report of the Woking Group European Statistical System Network on Culture (in short: ESSnet-Culture) (Bína, Vladimir et al. 2012) contains a rather detailed guideline n the report of the Task Force on Cultural Practices And Social Aspects Of Culture, in the ESSNet-Culture technical report (in general: pp. 236-242, survey methodology and harmonization pp. 242-255, recommendations: pp. 273-274; including an extensive annex with examples on how to use the question hierarchy on pp. 397-417.)

n the ESSNet-Culture technical report (in general: pp. 236-242, survey methodology and harmonization pp. 242-255, recommendations: pp. 273-274; including an extensive annex with examples on how to use the question hierarchy on pp. 397-417.)

The top-level, basic questions were standardized by the ESSnet culture working group. They are based on the ICET surveying model, that in turn has a history in quantitative surveying of entertainment industry audience since the 1970s. The ICET model itself was first designed in the Netherlands about 20 years ago for a better measurement of the then increasingly digital forms of cultural participation (Haan and Adolfsen 2008; Haan and Broek 2012).

Various models

For modelling we use the tidymodels concept, which brings hundreds of R analytical libraries, and increasingly Python libraries availbe via a unified API. Tidymodels allows us to pre-run tens of thousands of various models for the client, and make a pre-selection for them about the most promising analytical tools to use to their problem.

Tidymodels is itself in an early stage of development. Leaning tidymodels has enormous benefits for our analytical work, and if you want to be involved in the econometric / machine learning services, you have to use it.

It does not matter what final packages and classes you will use for final models. If you are comfortable with data.tables, you will use data.table. The tidyverse and the tidymodels play a role in pre-processing and processing various data resources for analysis.

How do reproducible research create value?

  • It improves work habits, and enhances the efficiency of analysts.

  • Increases teamwork, makes the integration and replacement of team members far easier.

  • Avoids duplication and multiplication of efforts. Dramatically reduces the time spend on data manipulation and debugging, error searching, formatting. This can save up to 80% of working time in analyst and consultant roles.

  • Relief senior staff as oversight is far more efficient, as most errors are captured automatically.

  • Better suited for cumulative growth of data, information and knowledge. In the medium run it reduces data and information cost significantly, and over the long run it produces a very strong competitive edge.

  • Replication, reproducibility and the higher standards of confirmabiliy and auditability are not only scientific standards, but they are often set by market regulators, professional standards and internal working guidelines.

  • Access to the growing body of open data in the EU (such as survey data, raw data used to calculate the inflation, etc.), which is as raw data free or almost free, but has large processing costs, as it is offered by public bodies on an as-is basis. This can replace costlier and often less valuable data acquisitions.

Functions [in programming] allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste […] You should consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). Hadely Wickham in R For Data Science.

We want to take this idea much further. We believe that any data table from Eurostat, IMF, various industry sources, APIs, that was downloaded, acquired at least twice, should arrive to the organization via a data ingestion application that automatically acquires every new edition of the data asset. Instead of re-formatting, adjusting for units, currencies, missing values each time, the application to always present the data asset in its best available form.

We believe that every table, visualization and supervised model (a model that is not created by a machine but your analysts) that has been produced at least twice, should be produced by an application that produces all related tables, visualization and model results at any change of the data.

Other observatories

Observatories are created to permanently collect data, information, and createa knowledge-base for research and development, science, evidence-based policymaking, usually by Consortia of business, NGO, scientific and public bodies.

We are aiming to create similar observatories, but we are in no way affiliated or connected to the following, existing observatories that we see as role models. Our mission is to serve similar observatories with research automation, making the observatory’s services less costly, more timely with higher level of quality control.


Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2020. Rmarkdown: Dynamic Documents for R.

Antal, Daniel. 2020a. iotables: Importing and Manipulating Symmetric Input-Output Tables.

Antal, Daniel. 2020b. Regions: Processing Regional Statistics.

Bína, Vladimir, Chantepie, Philippe, Deboin, Valérie, Kommel, Kutt, Kotynek, Josef, and Robin, Philippe. 2012. “ESSnet-CULTURE, European Statistical System Network on Culture. Final Report.” Edited by Frank, Guy.

Haan, Jos de, and Anna Adolfsen. 2008. De Virtuele Cultuurbezoeker - Publieke Belangstelling Voor Cultuurwebsites. SCP-Publicatie 2008/9. Den Haag, the Netherlands: Sociaal en Cultureel Planbureau.

Haan, Jos de, and Andries van den Broek. 2012. “Nowadays Cultural Participation - an Update of What to Look for and Where to Look for It.” In ESSnet-CULTURE, European Statistical System Network on Culture. Final Report., 397–417. Luxembourg.

Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools.

Lahti, Leo, Janne Huovari, Markus Kainu, and Przemyslaw Biecek. 2017. “Eurostat R Package.” R Journal.

Lahti, Leo, Janne Huovari, Markus Kainu, and Przemyslaw Biecek. 2020. Eurostat: Tools for Eurostat Open Data.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Templ, Matthias, and Valentin Todorov. 2016. “The Software Environment R for Official Statistics and Survey Methodology.” Austrian Journal of Statistics 45 (March): 97–124.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation.

Wickham, Hadley, and Lionel Henry. 2020. Tidyr: Tidy Messy Data.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC.

Xie, Yihui. 2020a. Bookdown: Authoring Books and Technical Documents with R Markdown.

Xie, Yihui. 2020b. Knitr: A General-Purpose Package for Dynamic Report Generation in R.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC.