This channel is to support discussion following up on @Sheeri Cabral (Collibra)’s presentation at the monthly meeting. Notes and recording here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-December8,2022(10amPT) Sheeri will start a google doc to document our conclusions and proposal to evolve the spec.
@Sheeri Cabral (Collibra) has joined the channel
@Mandy Chessell has joined the channel
@Maciej Obuchowski has joined the channel
@Michael Robinson has joined the channel
Document is here - https://docs.google.com/document/d/1ysZR13QwDvAiY_rQJedLHpnNn3deQeow7BmCNckd2uM/edit
At this point I only have the notes + recording link 😄
(right now everyone can edit it, if that becomes a problem, we can restrict it).
@Michael Collado has joined the channel
@Daniel Henneberger has joined the channel
I will synthesize what I think are the main points, but please change what I got wrong 😄
*Thread Reply:* thanks for kick-starting / driving this @Sheeri Cabral (Collibra) (long overdue)
What are some custom facets people have created, or want to see?
I’ll start! Custom facet of “rows affected” for DML (e.g. rows inserted, rows deleted, rows updated.
(and a grand vision would be to set data quality thresholds against this - an application could warn if a run deviates more than 10% from the mean/median rows affected from previous run jobs)
*Thread Reply:* Note that it’s possible this has already been implemented - that’s fine! We’re just gathering ideas, nothing is being set in stone here.
*Thread Reply:* you mean output statistics?
*Thread Reply:* Yes! it’s already part of output statistics. I was just giving one example of the kind of ideas we are looking for people to throw out there. Whether or not the facet exists already. 😄 it’s brainstorming, no wrong answers. (in this case the answer is “congrats! it’s already there!“)
*Thread Reply:* I’ve always been interested in finding a way for devs to publish business metrics from their apps - i.e., metrics specific to the job or dataset like rowsFiltered
or averageValues
or something. I think enabling comparison of these metrics with lineage would provide tremendous value to data pipeline engineers
*Thread Reply:* oh yeah. even rowsExamined - e.g. rowsExamined - rowsFiltered = rowsAffected some databases give those metrics from queries. And this is why they’re optional, BUT if you make a new company that has software to analyze pipeline metrics, that might be required for your software, even though it’s optional in the openlineage spec)
*Thread Reply:* I’m thinking about some custom facets that would allow intractability between different frameworks (possibly vendors). for example, I can throw in a link to the Airflow UI for a specific task that we captured metadata for, and another tool (say, a data catalog) can also use my custom facet to help guide users directly to where the process is running. there’s a commercial aspect to this as well, but I think the community aspect is interesting. I don’t think it should be required to comply with the spec, but I’d like to be able to share this custom facet for others to see if it exists and decorate accordingly. does that make sense?
@Anirudh Shrinivason has joined the channel
@Maciej Obuchowski @Jens Pfau I think this is the certification channel?
Some time ago I made a spreadsheet with all the metadata in the OpenLineage spec, with information such as what entity and facet each piece of metadata can be found in. Because sometimes we just need to know - where is that piece of information?
I think this might help and could be a sort of “mapping” for OpenLineage purpposes. I think there are Producers and Consumers who don’t know how to generate/consume openlineage and put stuff in random places 😄
Would this be useful for folks?
*Thread Reply:* I think it's a good starting point to start mapping scenarios
*Thread Reply:* OK, the spreadsheet is at https://docs.google.com/spreadsheets/d/1Vzf2LSCyISE-t4nd_9by8tGzdAy8m7sAbpyMSUHCl4U/edit?gid=155712120#gid=155712120 and anyone can edit it.
*Thread Reply:* +1 to Maciej. This is great. I made cheat notes on paper when I first went thru the spec a long time ago to create fictitious payloads for testing, and this would have helped immensely. Thank you Sheeri!