HQuery Distributed Queries for Health Data (W1F)

From IIW
Jump to: navigation, search

Session Topic: hQuery (W1F)

Convener: Justin Richer

Notes-taker(s): Eve Maler

Tags for the session - technology discussed/ideas considered:

Discussion notes, key understandings, outstanding questions, observations, and, if appropriate to this discussion: action items, next steps:


IIW13W1F.png

Convener: Justin Richer (@zer0n1ne) Notes-taker: Eve Maler (@xmlgrrl)


Justin advertised this session as trying to "hit the problem with the OAuth hammer"! The goal is to see if OAuth provides the right access management framework for tackling this problem.

Introducing the hQuery project (http://projecthquery.org/), part of the Query Health initiative (http://wiki.siframework.org/Query+Health), run by ONC (the Office of the National Coordinator for Health Information Technology). MITRE has come up with a system that allows distributed queries across health info. There are lots of privacy concerns, data security concerns, and distributed access concerns.

The idea is that you put together a query on the client side, e.g. how many people are between the ages of x and y, how many people are taking a certain drug, etc. for epidemiological survey purposes, demographic studies, and similar. This data exists in many places, but we can't just have people all the data everywhere, even if it's somewhat de-identified. It can be deanonymized too easily.

You want to be able to make the query very specific and complex, across a variety of data sources. The sources need to toss their results back to an aggregator. It should be possible to keep the query results updated too. And it shouldn't be possible to query for data about what amounts to a single known person who can be identified (like Lady Gaga or George Clooney, or even a not-famous person in a small town who's well known, there, to have HIV). Correlation over subsequent data cuts could deanonymize pretty handily.

The demographic goal of Query Health is different from patient care systems, which do need individual patient data access. Project hData is more about electronic health care records, and it's using things like UMA to manage access by various healthcare professionals.

For Query Health, you need to think beyond SQL. The client generates a query and sends it across a query network. Is this a map-reduce pattern? This is how they've happened to implement it, but this isn't a formal program decision at this point. The theory is that map-reduce has fewer constraints than a formal query language. The working group is likely to offer several different query paradigms. There's been discussion of rules-based forms and SQL-like forms.

The health data world likes to invent their own stuff -- off-the-shelf solutions are often spurned because "people could die". So the architecture to date is somewhat ad hoc with respect to security etc. Justin's work involves applying RESTful patterns and OAuth2 to enable better access control.

The query builders send the query out to known endpoints representing standardized RESTful gateways to data sources. The aggregator knits together all the results, even if they're very disparate in nature. Query results could come back at wildly different times; it can't be synchronous. So it's not a single HTTP transaction or socket connection. It's basically a batch system that's designed to tolerate eventual consistency. There's no blocking or crashing on lack of results from one source, though the querier is informed which sources haven't contributed. Queries can last forever or be time-limited. Eventually, the client gets a result that can be used in writing reports and so on.

The trust is managed through joining, e.g., the CDC trust network and living up to their constraints. The authorization server is some trusted party that could live at the gateway level, the query builder/aggregator level, or even a third party that's trusted. The resources are OAuth-protected at the RESTful gateways.

Who pays the cost of performing these potentially expensive queries (effectively DoS, whether malicious or accidental)? That's a simple matter of sysadmin. :-)

Likely each organization that runs a query builder/aggregator will offer its own client app.

There's the client, the local access point, the trust network association (AS), and the data/gateways/PRs. Wouldn't the gateway want to audit who gave access (which AS)? This is where the AS/PR separation becomes important. It was noted that UMA formally separates ASs and PRs, and provides a way for them to build mutual trust in the context of a particular user. Or in a purely pairwise fashion, they could use JWT keys and dynamic discovery to bootstrap trust-building.

Query Health should look at the DURSA from NHIN (http://healthit.hhs.gov/portal/server.pt/gateway/PTARGS_0_10731_849891_0_0_18/DRAFT%20NHIN%20Trial%20Implementations%20Production%20DURSA-3.pdf) for trust relationship building.

Or the query builder could naturally serve as a discovery service for finding the relevant data sources. This would be a good idea to consider. Since this network is not meant to be open to the public, there are optimizations that could be applied.

Are there always four parties, or are there multiple ASs and other nested relationships? Would opaque tokens continue to work if you received an opaque token in and need to validate it but don't know which AS issued it? Perhaps the scope in the header could say which AS issued it, by agreement.

The counterproposal that has been made is to use persistent direct two-way TLS connections between TCP sockets. You could even use SSL JDBC connections! But that seems very tightly coupled. You could have possibly millions of data sources, and the gateways are likely to be run by smaller institutions.

So the AS seems to be an important role, akin to a CA in the "old world". Certificate revocation lists will tend to weight heavily on the overall cost of the system, as always.

How important is it to make the contractual relationships more flexible? The query builders/aggregators could get no permissions on their own, other than what the client allows them to have. Once there's an authorizing user who operates the client and can accept liability for the connections they're forging, this is where UMA's phase 1 potentially gets more relevant.

If each gateway had its own unique set of scopes governing its data sources, maybe the client would have to ask the query builder to go off and figure out what scopes will be needed, before actually making the query of it.

Can the client can be simply an OAuth client?

The personal health "bank account" model, like Google Health's model, is starting to gain favor with HHS. This is more like hData (for which see a potential protection model here: http://kantarainitiative.org/confluence/display/uma/hdata_scenario). If you used these as data sources, what do you do about the duplicate-record problem? Actually, you have this problem regardless. "GIGO." If you could fix this, you'd be linking data records, which violates privacy.

The health vaults are often operated as cloud services. Some of these are state-run.

It might be worth looking at Medify.com (https://www.medify.com/), which is taking something like this public-data approach, for individual people.

The gateway itself is having to be trusted to see the raw data from the data sources, so there are sandboxing requirements even at that level, and potentially even something like certification for those parties. If you accept such risk, there's a business opportunity for being paid to take on that risk. Thus, the gateway is where the business model opportunity lies.

If SWD gets used, it would be for initial bootstrapping, but then there are "semantic" aspects of discovery around finding which gateways offer which data sources that would be needed to satisfy the query.

The emphasis in this whole project is IETF-like: rough consensus and running code. There's also the NHIN Direct project. MITRE and Microsoft are both involved in the Query Health initiative, which is helpful for ensuring we don't have totally opposing approaches; likely a blend would be best. hQuery is being run as an open group, so interested parties can get involved pretty easily. There's a F2F being held in DC this week.