Reading time 17 minutes

Medical espionage

13 Feb 2021

The prospect of genomic medical espionage

In the heyday of human source intelligence - think late Cold War - international espionage was relatively easy to understand. This consisted of finding people willing to gather information from the enemy and deliver it to you. The recruited individuals could be sympathetic to your cause and volunteer, or they could be coerced and remain secret in fear of punishment by the enemy. Any reader of Le Carre will immediately envision the tangled web that this can ultimately produce; running agents, dead drops, double agents proving chicken-feed information to handlers to gain their trust! All very dramatic and palpable. The other examples of intelligence data collection may be less tangible without experience; geospatial, measurement and signature, open-source, signals, technical, cyber, or financial. The information gathered can be powerful but each is a “one-hit”. If the target changes, then the old data may not be useful again. Now think of genetic data. Once gathered, it never changes. It only becomes more powerful when metadata is added.

Consider another distinction from the era of Cold War espionage. It was essentially one side versus the other. In the year 2021, most nations don’t have a direct enemy. Therefore, today it is difficult to even summarise data collection in terms of “from whom” or “against what”. Between 2013 and today, the revelation of global communications interception has shown that the easiest approach for intelligence data collection is just to gather everything and then figure out who the enemy is later. I don’t have a clear example of how this relates to medical or genomic data, but I do keep in mind that the data holder can decide at any time in the future what genotypes they determine are of interest. Policy makers will be very slow to recognise that an a priori usage plan is paramount to genomic data protection.

I heard about a meeting between two national intelligence groups (e.g. nation A and nation B). Nation A wanted to discuss corporate espionage. Nation B put it off as long as possible until finally explaining that in nation A, yes, national espionage is commendable to protect their country while corporate espionage is frowned upon. However, for nation B any kind of intelligence that benefits the country is heroic. Nation B freely acknowledges that anything worth stealing from A, will be stolen. One cannot complain about the rules defined in the game, however those rules should not be forgotten just because nation A does not recognise them as readily. We should keep in mind that our genomic data handlers may unexpectedly decide to share the data with others outside of our control. GDPR won’t be of much use. This anecdote illustrates a pertinent fact for human genomic medicine. Unlike most personal data (email address, phone number, physical address, interests, EMEI), you only have one genome sequence. Once the online medical database gets leaked, you cannot request a new genome sequence via email. Once it is out, it is out.

In a lecture by NSA technical director of the information assurance directorate, Richard George stated that credit card data is worth very little today on “the market”; health information is the new target of interest, because with that information comes an individuals’ identity and the potential to order drugs in their name. With a little more imagination, one can envision the potential for genomic data. Some of the most powerful analysis techniques are still being developed today and we may see a big leap in genomic interpretation in the next decade. Once the analysis protocols are complete, whoever has the biggest database will be the most powerful (and for entrepreneurs, the most profitable, if their focus is on the correct questions in human health, etc).

Genomic data is most readily applicable to health. However, it could stretch much further. Everyone knows about Google and Facebook - our daily activities categorise us into advertisement groups: Male age 25-30, interests, dislikes, how much are we willing to spend, etc. It boils down do “how much advertisement costs are required to sell one unit of product?”. In genomics and health this equates to effect size - what are the odds of disease given the genotype? How about “given the genotype, how risk aversion are we, how impulsive, will we travel abroad or should we see ads for local entertainment”? With greater and greater complexity, subtlety, and granularity, this may be possible as long as we can quantify the heritability of complex traits at very low effect sizes. Advertising and insurance risk calculations depend on relatively simple statistical formulas that just need enough data to remain profitable. It is almost certain that much more esoteric applications are on the way. It is extremely unlikely that legal protections on genomic data will be available before the open market dictates the trajectory.

Will there even be a need for medical espionage when Illunima, Google, BGI, etc. start offering free genome sequencing? Will it be good or bad once every child gets genome sequencing along with their birth certificate? Predicting the balance of world order is folly, but as they say “history repeats itself”. We can detach from our virtual identities; online presence, daily routines, but we cannot detach from our genetic identity. Unlike fingerprinting, or retinal scan, are we prepared to provide a biometric identity with so much information?


There have been several ransomware attacks on the health industry in recent years. These have included public and private research and innovation institutions. However, the worst examples of this type of crime were seen during the 2017 WannaCry attack;

“One of the largest agencies struck by the attack was the National Health Service hospitals in England and Scotland, and up to 70,000 devices – including computers, MRI scanners, blood-storage refrigerators and theatre equipment – may have been affected.” [1].

Legal safeguard

With the major risks to life caused by attacks on medical institutions, the COVID-19 crisis has prompted a clear messages via the Oxford Statement on the International Law Protections Against Cyber Operations Targeting the Health Care Sector, and a second statement on Safeguarding Vaccine Research during May and August 2020, respectively [2, 3].

International humanitarian law requires that medical care is respected and protected. COVID-19 illustrates that primary research is just as critical and should have the same protections. In general, publicly funded research should be open and freely accessible to all (while respecting the privacy of human health data and personal data). However, the long and complex process of primary research means that publication or open-sourcing can take a long time. Furthermore, the researchers depend on recognition of their work and are unlikely to publish intermediate results.


It is understandable that nations who might be more interested in privately succeeding will be interested in stealing any information available. Or more likely, this could be seen from private companies that are willing to steal intellectual property (IP). For example, for a specific pathogen like SARS-CoV-2 just knowing what amino acid residues your competitor is most interested in can give you immediate insight that might have taken months to produce. Research project datasets tend to start out broad and move linearly towards a final result. If the actual documentation and code can be read then these critical results will be obvious. However, even just metadata like filenames can provide the key information. It is not unreasonable to assume that the researchers will simply name datasets incrementally with the key process used. As an example, in a database you might see files with ascending date stamps:

  • data_group1.csv
  • data_group1_pruned.csv
  • data_group1_pruned_significant_hits.csv
  • data_group1_pruned_significant_hits_pR127L.csv

Anyone working on the same problem will understand the routine protocols and know that focusing their research on amino acid p.R127L will give them an advantage. Tackling this problem is one for IP law. If it was for something like vaccine research then one might argue about applied ethics - “is it wrong to steal that which should be free information?” - but that weak argument is quickly disarmed by the fact that we would want our vaccine to come from the primary researcher, not the thief who is willing to cut corners.

It is obvious that protection should be implemented to prevent theft of public IP. Furthermore, publicly funded health research results usually reside alongside private health information that deserves to have strong protections.


Disinformation, dezinformatsiya, includes the leaking of information that seems valuable but is either a dead-end, or worse, intentionally harmful. It is critical to ban the use of disinformation in any research affecting human health or publicly funded research. It would be better to have instances of valid IP theft than risk any harm.

Data pollution

Conversely, data pollution is another potential risk factor. Large scale genomics relies on careful curation. Importing incorrect data will pollute analysis and potentially mask true positive results from being found. In the last few years, some commercial genomic services have allowed users to upload their own genomic and personal phenotypic information. While most users are just interested in their own results, this has a reasonably large potential for risk - a targeted data submission, randomly shuffling input phenotypic information would weaken the database for association analysis.

In the media

In this section I use some examples from popular media. I want to acknowledge that popular news stories for audiences in UK/USA often use scaremongering tropes of East versus West and often attribute individual responsibility to nations - China, N Korea, Russia, etc. Such references are used here only as media examples. In the classic egregious intelligence-community style, the most flamboyant reports include the least tangible evidence.

I notice that reports of medical espionage in public media are not always accurately defined. The news story of a former University Of Florida researcher indicted for scheme to defraud has, in other places, been framed more like someone working under cover for China rather than the more accurate description of someone committing fraud for failing to report overseas funding sources [4]. There are examples of stolen research IP for personal gain, such as the “hospital researcher sentenced to prison for conspiring to steal trade secrets, sell them in China” [5]. In this case, after ten years in the field the researcher was accused of “stealing exosome-related trade secrets concerning the research, identification and treatment of a range of pediatric medical conditions” and then “creating and selling exosome isolation kits” for sale via her company in China [5].

Unlike these examples of personal gain, there have been reports of national medical espionage during to the COVD-19 crisis.

The BBC reported in November 2020 [6] that

“Microsoft said at least nine health organisations including Pfizer had been targeted by state-backed organisations in North Korea and Russia”.

In Feb 2021, a claim without source by Yonhap news [7] and repeated by the BBC [8] says that the South Korean

“National Intelligence Service (NIS) unveiled information during a closed-door session of the National Assembly’s intelligence committee stating that North Korea has attempted to hack the servers of a local drug manufacturer to obtain technology information on the company’s coronavirus vaccine and treatment.”

Several similar stories can be read in the references [9-11]. Reports like this rarely include published evidence and may be nothing but fantasy dreamed up by The Tailor of Panama type reporters.

However, targeted theft is very likely and one would assume it to be happening even when specific reports are unsubstantiated. No matter who the media reports as the national enemy is this year, valuable genomic data is at potential risk. Researchers and commercial providers should think about mitigating this risk, not by implementing heavier security but by making the data public (safely). Their value can be generated via public IP in software as a service (SaaS) rather than by hoarding sensitive data.

Protecting data and promoting open-source access

Projects focused on safe access

The Global Alliance for Genomics and Health (GA4GH) is leading the effort in safe data access.

“GA4GH is a policy-framing and technical standards-setting organization, seeking to enable responsible genomic data sharing within a human rights framework. Enabling responsible genomic data sharing for the benefit of human health.”

The driver projects promoted by GA4GH are the some of the best real world examples today. These include projects such as

All of the projects play an extremely important role in human health and research. There are also many other similar initiatives outside of GA4GH which are promoting science collaboration.

Obstacles to data privacy

Attending the GA4GH meetings and working as part of some of these projects, I am struck by the fact that genomic privacy generally depends on a user trust system, and data protection is focused on the end-user stage. I can exemplify the problem with the following simple example:

Every subject relying on genomic analysis must submit a DNA sample along with genetic consent, and the data is prepared in several stages;

  1. Sample collection
  2. Sample preparation
  3. Sequencing
  4. Data processing
  5. Data submission
  6. Data access
  7. Reporting

Problem level [1]
The best systems today use a tracking system where the sample collection will produce an anonymised ID in step 1. All subsequent steps will therefore be detached from the subject’s personal information. However, the personal information is not necessarily the valuable info, the genome sequence is (even if anonymised). Anyone who has access to the DNA sample can easily sequence the genome for less than $500 US. Storing and resequencing DNA is actually becoming cheaper than storing the data. The sample freezers may become more valuable to thief than data servers.

Problem level [2-3]
The sample preparation and sequencing has identical risk as step 1. However, the sample is now likely out of the hands of the primary person responsible. It will most likely be in a large scale sequence facility. Keep in mind that for measurement accuracy in 100ng of DNA library; 50ng is sequenced and 50ng is thrown away (often but not always).

Problem level [4]
Data processing will become more routine over time. The large scale sequencing projects all follow a strict analysis pipeline, but since a large majority of processing today is for clinical diagnosis, it means that at some stage a researcher will be required to do custom analysis on an individual sample. This person is likely to have unrestricted access to all database samples. The admin of the whole pipeline also has unrestricted access to all data.

Problem level [5]
The data submission level will consist of anonymous subject IDs again, however it will contain whole genome sequences (or processed variant called datasets). This is probably the best stage for an opportunist to make a copy; the dataset is reduced to the key info and ready for downstream analysis.

Problem level [6]
Data access is the step in which nearly every genomic data protection process is focused. It is a logical starting position since this is the stage in which researchers will require permissions for accessing large amounts of data for research purposes. It has the addition risk in that other medical data is usually also present, typically clinical phenotype data.

Problem level [7]
Once a candidate genomic determinant for clinical diagnosis is established, the researcher is going to complete a reporting procedure. One would imagine that this is a clean, automated, process. However, it is very common for researchers and clinicians to simply email back and forth about very sensitive information. This is understandable and often a patient’s life could be saved with a rapid diagnosis.

However, the facts should be clearly stated. Plain text emails, SMS, and other types of communications are collected routinely via national surveillance. Not to be dismissed as conspiracy theory, it is a fact that your private information may be collected without your consent, but you will get more expertise in your medical treatment when physicians can communicate with their colleagues via email, etc.

Problem summary:
Nearly all privacy protocols today are focused on step 6. Anyone interested in large scale medical espionage will focus on any of the other, much more readily available, steps 1-5. Furthermore, data access at step 6 can be restricted to trusted researchers, but there should be no confusion - humans can always find methods for exporting data from protected access portals 1. While the privilege of data access is restricted, it is essentially based on trust and recruited patient participants should not be lead to believe otherwise. Some very sophisticated methods allow for data analysis of encrypted data, but these are not widely used today and will not be able to replace all of the required methods in the near future.

Future methods

In addition to the highly commendable initiatives for genomic data sharing today, there are several options that can be implemented in the future.

To be finished..

One may argue that the proposed danger is hyperbolic; that it is a subject for intellectual property rather than political. When once ineptly accused of making a topic political, Christopher Hitchens extemporaneously rebutted with an apt simile to the nuclear threat at the time (1984), arguing that some topics are explicitly political whether or not others lack a recognition of facts:

“As we sit here, everyone is this room has been made into a front-line soldier; the nuclear age means that we are all conscripted; we don’t have the right to conscientious objection anymore. We are on the front-line while the soldiers are in the bunkers. That is being done to us whether we like it or not. We can’t then complain of those that are objecting to it that they are politicising it. The politicisation has been done. We are all conscripted. We might as well be sitting here in uniform.” [12]

While the comparison to a nuclear détente is an exaggeration, this problem is not simply a matter of consumer choice. How difficult it is, even to quantify the dangers amongst experts in human genomics. Not even the unit of measurement is known to quantify the declination of risk.


[1] WannaCry ransomware attack,

[2] The Oxford Statement on the International Law Protections Against Cyber Operations Targeting the Health Care Sector

[3] The Second Oxford Statement on International Law Protections of the Healthcare Sector During Covid-19: Safeguarding Vaccine Research

[4] Former University Of Florida Researcher Indicted For Scheme To Defraud National Institutes Of Health And University Of Florida

[5] Hospital researcher sentenced to prison for conspiring to steal trade secrets, sell them in China

[6] Coronavirus: North Korea and Russia hackers ‘targeting vaccine’.

[7] N. Korea attempted to steal COVID-19 vaccine, treatment technology via hacking: NIS

[8] North Korea accused of hacking Pfizer for Covid-19 vaccine data.

[9] Coronavirus: Cyber-spies hunt Covid-19 research, US and UK warn

[10] The Cyber Side of Vaccine Nationalism

[11] Race for Coronavirus Vaccine Pits Spy Against Spy
[12] Firing line Episode S0629, Recorded on December 11, 1984. Guests: R. (Robert Emmett) Emmett Tyrrell Jr., Christopher Hitchens.\