Slack Export - #spec-compliance

Julien Le Dem (julien@apache.org)

2023-01-12 08:55:50

@Julien Le Dem has joined the channel

Julien Le Dem (julien@apache.org)

2023-01-12 09:00:07

This channel is to support discussion following up on @Sheeri Cabral (Collibra)’s presentation at the monthly meeting. Notes and recording here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-December8,2022(10amPT) Sheeri will start a google doc to document our conclusions and proposal to evolve the spec.

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-01-12 09:00:32

@Sheeri Cabral (Collibra) has joined the channel

Mandy Chessell (mandy.e.chessell@gmail.com)

2023-01-12 09:00:33

@Mandy Chessell has joined the channel

Ernie Ostic (ernie.ostic@getmanta.com)

2023-01-12 09:00:33

@Ernie Ostic has joined the channel

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2023-01-12 09:13:48

@Maciej Obuchowski has joined the channel

Michael Robinson (michael.robinson@astronomer.io)

2023-01-12 11:40:03

@Michael Robinson has joined the channel

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-01-12 12:47:20

Document is here - https://docs.google.com/document/d/1ysZR13QwDvAiY_rQJedLHpnNn3deQeow7BmCNckd2uM/edit

At this point I only have the notes + recording link 😄

:gratitude_thank_you: Michael Collado, Will Johnson

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-01-12 12:47:39

(right now everyone can edit it, if that becomes a problem, we can restrict it).

Harel Shein (harel.shein@gmail.com)

2023-01-12 13:40:17

@Harel Shein has joined the channel

Laurent Paris (laurent@datakin.com)

2023-01-12 14:00:16

@Laurent Paris has joined the channel

Michael Collado (collado.mike@gmail.com)

2023-01-12 14:03:24

@Michael Collado has joined the channel

Daniel Henneberger (me@danielhenneberger.com)

2023-01-12 14:05:11

@Daniel Henneberger has joined the channel

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-01-12 14:05:59

I will synthesize what I think are the main points, but please change what I got wrong 😄

🙌 Willy Lulciuc

Willy Lulciuc (willy@datakin.com)

2023-01-13 18:37:43

*Thread Reply:* thanks for kick-starting / driving this @Sheeri Cabral (Collibra) (long overdue)

Willy Lulciuc (willy@datakin.com)

2023-01-12 14:06:10

@Willy Lulciuc has joined the channel

Kengo Seki (sekikn@gmail.com)

2023-01-14 19:29:10

@Kengo Seki has joined the channel

Ross Turk (ross@datakin.com)

2023-01-15 10:49:44

@Ross Turk has joined the channel

Orbit

2023-01-24 14:34:11

@Orbit has joined the channel

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-02-09 14:01:39

What are some custom facets people have created, or want to see?

I’ll start! Custom facet of “rows affected” for DML (e.g. rows inserted, rows deleted, rows updated.

(and a grand vision would be to set data quality thresholds against this - an application could warn if a run deviates more than 10% from the mean/median rows affected from previous run jobs)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-02-09 14:02:44

*Thread Reply:* Note that it’s possible this has already been implemented - that’s fine! We’re just gathering ideas, nothing is being set in stone here.

Michael Collado (collado.mike@gmail.com)

2023-02-09 14:03:04

*Thread Reply:* you mean output statistics?

<https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/OutputStatisticsOutputDatasetFacet.json | OutputStatisticsOutputDatasetFacet.json>

<pre><code>{ "$schema" : "<https://json-schema.org/draft/2020-12/schema>", "$id" : "<https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json>", "$defs" : { "OutputStatisticsOutputDatasetFacet" : { "allOf" : [ { "$ref" : "<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/OutputDatasetFacet>" }, { "type" : "object", "properties" : { "rowCount" : { "description" : "The number of rows written to the dataset", "type" : "integer" }, "size" : { "description" : "The size in bytes written to the dataset", "type" : "integer" } }, "required" : [ "rowCount" ] } ] } }, "type" : "object", "properties" : { "outputStatistics" : { "$ref" : "#/$defs/OutputStatisticsOutputDatasetFacet" } } } </code></pre>

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-02-09 14:04:07

*Thread Reply:* Yes! it’s already part of output statistics. I was just giving one example of the kind of ideas we are looking for people to throw out there. Whether or not the facet exists already. 😄 it’s brainstorming, no wrong answers. (in this case the answer is “congrats! it’s already there!“)

👏 Michael Collado

Michael Collado (collado.mike@gmail.com)

2023-02-09 14:14:42

*Thread Reply:* I’ve always been interested in finding a way for devs to publish business metrics from their apps - i.e., metrics specific to the job or dataset like rowsFiltered or averageValues or something. I think enabling comparison of these metrics with lineage would provide tremendous value to data pipeline engineers

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2023-02-09 14:16:53

*Thread Reply:* oh yeah. even rowsExamined - e.g. rowsExamined - rowsFiltered = rowsAffected some databases give those metrics from queries. And this is why they’re optional, BUT if you make a new company that has software to analyze pipeline metrics, that might be required for your software, even though it’s optional in the openlineage spec)

Harel Shein (harel.shein@gmail.com)

2023-02-10 10:07:13

*Thread Reply:* I’m thinking about some custom facets that would allow intractability between different frameworks (possibly vendors). for example, I can throw in a link to the Airflow UI for a specific task that we captured metadata for, and another tool (say, a data catalog) can also use my custom facet to help guide users directly to where the process is running. there’s a commercial aspect to this as well, but I think the community aspect is interesting. I don’t think it should be required to comply with the spec, but I’d like to be able to share this custom facet for others to see if it exists and decorate accordingly. does that make sense?

Mike Dillion (mike.dillion@gmail.com)

2023-02-11 18:51:45

@Mike Dillion has joined the channel

Mahesh Gawde (mahesh.gawde@originml.ai)

2023-02-28 05:53:27

@Mahesh Gawde has joined the channel

Thijs Koot (thijs.koot@gmail.com)

2023-02-28 08:36:33

@Thijs Koot has joined the channel

Anirudh Shrinivason (anirudh.shrinivason@grabtaxi.com)

2023-03-09 14:12:58

@Anirudh Shrinivason has joined the channel

jrich (jasonrich85@icloud.com)

2023-03-10 14:52:40

@jrich has joined the channel

Suraj Gupta (suraj.gupta@atlan.com)

2023-06-01 13:51:29

@Suraj Gupta has joined the channel

Yuanli Wang (yuanliw@bu.edu)

2023-07-12 17:18:25

@Yuanli Wang has joined the channel

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-11 16:34:43

@Maciej Obuchowski @Jens Pfau I think this is the certification channel?

👍 Maciej Obuchowski

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-11 16:36:32

Some time ago I made a spreadsheet with all the metadata in the OpenLineage spec, with information such as what entity and facet each piece of metadata can be found in. Because sometimes we just need to know - where is that piece of information?

I think this might help and could be a sort of “mapping” for OpenLineage purpposes. I think there are Producers and Consumers who don’t know how to generate/consume openlineage and put stuff in random places 😄

Screenshot 2024-07-11 at 4.36.25 PM.png

👍 Maciej Obuchowski

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-11 16:36:52

Would this be useful for folks?

Maciej Obuchowski (maciej.obuchowski@getindata.com)

2024-07-12 04:10:24

*Thread Reply:* I think it's a good starting point to start mapping scenarios

✅ Sheeri Cabral (Collibra)

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-12 09:22:22

*Thread Reply:* OK, the spreadsheet is at https://docs.google.com/spreadsheets/d/1Vzf2LSCyISE-t4nd_9by8tGzdAy8m7sAbpyMSUHCl4U/edit?gid=155712120#gid=155712120 and anyone can edit it.

Ernie Ostic (ernie.ostic@getmanta.com)

2024-07-12 09:26:38

*Thread Reply:* +1 to Maciej. This is great. I made cheat notes on paper when I first went thru the spec a long time ago to create fictitious payloads for testing, and this would have helped immensely. Thank you Sheeri!

Sheeri Cabral (Collibra) (sheeri.cabral@collibra.com)

2024-07-11 16:36:58

(it’s not up to date)

Public Channels

Private Channels

Direct Messages

Group Direct Messages