@Maciej Obuchowski has joined the channel
@Paweł Leszczyński has joined the channel
@Jakub Dardziński has joined the channel
@Michael Robinson has joined the channel
https://github.com/OpenLineage/OpenLineage/pull/2260 fun PR incoming
*Thread Reply:* hey look, more fun https://github.com/OpenLineage/OpenLineage/pull/2263
*Thread Reply:* nice to have fun with you Jakub
*Thread Reply:* Can't wait to see it on the 1st January.
*Thread Reply:* Ain’t no party like a dev ex improvement party
*Thread Reply:* Gentoo installation party is in similar category of fun
@Paweł Leszczyński approved PR #2661 with minor comments, I think the enum defined in the db layer is one comment we’ll need to address before merging; otherwise solid work dude 👌
_Minor_: We can consider defining a _run_state column and eventually dropping the event_type. That is, we can consider columns prefixed with _ to be "remappings" of OL properties to Marquez.
-> didn't get this one. Is it for now or some future plans?
*Thread Reply:* I will then replace enum with string
also, what about this PR? https://github.com/MarquezProject/marquez/pull/2654
*Thread Reply:* this is the next to go
*Thread Reply:* and i consider it ready
*Thread Reply:* Then we have a draft one with streaming support https://github.com/MarquezProject/marquez/pull/2682/files -> which has an integration test of lineage endpoint working for streaming jobs
*Thread Reply:* I still need to work on #2682 but you can review #2654. once you get some sleep, of course 😉
Got the doc + poc for hook-level coverage: https://docs.google.com/document/d/1q0shiUxopASO8glgMqjDn89xigJnGrQuBMbcRdolUdk/edit?usp=sharing
*Thread Reply:* did you check if LineageCollector
is instantiated once per process?
*Thread Reply:* Using it only via get_hook_lineage_collector
Anyone have thoughts about how to address the question about “pain points” here? https://openlineage.slack.com/archives/C01CK9T7HKR/p1700064564825909. (Listing pros is easy — it’s the cons we don’t have boilerplate for)
*Thread Reply:* Maybe something like “OL has many desirable integrations, including a best-in-class Spark integration, but it’s like any other open standard in that it requires contributions in order to approach total coverage. Thankfully, we have many active contributors, and integrations are being added or improved upon all the time.”
*Thread Reply:* Maybe rephrase pain points to "something we're not actively focusing on"
Apparently an admin can view a Slack archive at any time at this URL: https://openlineage.slack.com/services/export. Only public channels are available, though.
*Thread Reply:* you are now admin
have we discussed adding column level lineage support to Airflow? https://marquezproject.slack.com/archives/C01E8MQGJP7/p1700087438599279?thread_ts=1700084629.245949&cid=C01E8MQGJP7
*Thread Reply:* we have it in SQL operators
*Thread Reply:* OOh any docs / code? or if you’d like to respond in the MQZ slack 🙏
*Thread Reply:* I’ll reply there
Any opinions about a free task management alternative to the free version of Notion (10-person limit)? Looking at Trello for keeping track of talks.
*Thread Reply:* What about GitHub projects?
*Thread Reply:* Projects is the way to go, thanks
*Thread Reply:* Set up a Projects board. New projects are private by default. We could make it public. The one thing that’s missing that we could use is a built-in date field for alerting about upcoming deadlines…
worlds are colliding: 6point6 has been acquired by Accenture
*Thread Reply:* https://newsroom.accenture.com/news/2023/accenture-to-expand-government-transformation-capabilities-in-the-uk-with-acquisition-of-6point6
*Thread Reply:* We should sell OL to governments
*Thread Reply:* we may have to rebrand to ClosedLineage
*Thread Reply:* not in this way; just emit any event second time to secret NSA endpoint
*Thread Reply:* we would need to improve our stock photo game
CFP for Berlin Buzzwords went up: https://2024.berlinbuzzwords.de/call-for-papers/ Still over 3 months to submit 🙂
*Thread Reply:* thanks, updated the talks board
*Thread Reply:* https://github.com/orgs/OpenLineage/projects/4/views/1
*Thread Reply:* I'm in, will think what to talk about and appreciate any advice 🙂
just searching for OpenLineage in the Datahub code base. They have an “interesting” approach? https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd[…]odules/airflow-plugin/src/datahubairflowplugin/_extractors.py
*Thread Reply:* It looks like the datahub airflow plugin uses OL. but turns it off
https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/docs/lineage/airflow.md
disable_openlineage_plugin true Disable the OpenLineage plugin to avoid duplicative processing.
They reuse the extractors but then “patch” the behavior.
*Thread Reply:* Of course this approach will need changing again with AF 2.7
*Thread Reply:* It looks like we can possibly learn from their approach in SQL parsing: https://datahubproject.io/docs/lineage/airflow/#automatic-lineage-extraction
*Thread Reply:* what's that approach? I only know they have been claiming best SQL parsing capabilities
*Thread Reply:* I haven’t looked in the details but I’m assuming it is in this repo. (my comment is entirely based on the claim here)
*Thread Reply:* <https://www.acryldata.io/blog/extracting-column-level-lineage-from-sql>
-> The interesting difference is that in order to find table schemas, they use their data catalog to evaluate column-level lineage instead of doing this on the client side.
My understanding by example is: If you do
create table x as select ** from y
you need to resolve **
to know column level lineage. Our approach is to do that on the client side, probably with an extra call to database. Their approach is to do that based on the data catalog information.
I’m off on vacation. See you in a week
Maybe move today's meeting earlier, since no one from west coast is joining? @Harel Shein
*Thread Reply:* Ah! That would have been a good idea, but I can’t :(
*Thread Reply:* Do you prefer an earlier meeting tomorrow?
*Thread Reply:* maybe let's keep today's meeting then
The full project history is now available at https://openlineage.github.io/slack-archives/. Check it out!
*Thread Reply:* tfw you thought the scrollback was gone 😳
*Thread Reply:* slack has a good activation story, I wonder how much longer they can keep this up for
*Thread Reply:* always nice to be reminded that there are no actual incremental costs on their end
*Thread Reply:* I guess it’s the difference between storing your data in memory vs. on a glacier 🧊
*Thread Reply:* ah yes surely there is some tiering going on
*Thread Reply:* i might get to marquez slack/PRs today, but most likely tmr morning
*Thread Reply:* If you’re looking for priorities, it would be really great if you could give feedback on one of @Paweł Leszczyński streaming support PRs today
*Thread Reply:* ok, I’ll get to the streaming PR first
*Thread Reply:* FYI, the namespace filtering is a good idea, just needs some feedback on impl / naming
Jens would like to know if there’s anything we want included in the welcome portion of the slide deck. Suggestions? (Aside from the usual links)
@Paweł Leszczyński I reviewed your PR today (mainly the logic on versioning for streaming jobs); here is the main versioning limitations for jobs: a new JobVersion
is created only when a job run completes or fails (or is in the done state); that is, we don’t know if we have received all the input/output datasets so we hold off on creating a new job version until we do.
For streaming, we’ll need to create a job version on start. Do we assume we have all input/output datasets associated with the streaming job? Does OpenLineage guarantee this to be the case for streaming jobs? Having versioning logic for batch vs streaming is a reasonable solution, just want to clarify
*Thread Reply:* yes, the logic adds distinction on how to create job version per processing type. For streaming, I think it makes more sense to create it at the beginning. Then, within other events of the same run, we need to check if the version has changed, and create new version in that case
*Thread Reply:* would we want to use the same versioning func Utils.newJobVersionFor() for streaming? That is, should we assume the input/output datasets contained within the OL event be the “current” set for the streaming job?
*Thread Reply:* that is,
2 input streams, 1 output stream (version 1)
then, 1 input streams, 2 output stream (version 2)
...
*Thread Reply:* but what about the case when the in/out streams are not present:
1 input streams, 2 output stream (version 2)
then, 1 input streams, 0 output stream (version 3)
...
*Thread Reply:* The meaning for the streaming events should be slightly different.
For batch, input and output datasets are cumulative from all the events. If we have an event with output datasets A + B, then another event with output datasets B + C, then we assume job has output datasets A + B + C.
For streaming, we may have a streaming job than for a week was reading data from topics A + B, and then in the next week it was reading from B + C. I think this should the mimicked in different job versions. Making it cumulative for jobs that run for several weeks does not make that much sense to me. The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not?
This part is missing in PR and our Flink integration always sends all the input & datasets. I can add extra logic that will prevent creating new job version if event has no input nor output datasets. However, I can't see any clean and generic solution to this.
*Thread Reply:* > The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not?
We can view the bytes read as additional metadata about the jobs inputs/outputs that wouldn’t trigger a new version (for the job or dataset). I would associate the bytes with the current dataset version and sum them up (I’ve read X
bytes from dataset version D
); you can also view tags in a similar way. In our current versioning logic for datasets, we create a new dataset version when a job completes, I think we’ll want to do something similar for streaming jobs; that is, when X
bytes are written to a given dataset that would trigger a new version
*Thread Reply:* > I can add extra logic that will prevent creating new job version if event has no input nor output datasets
Yes, if in/out no datasets are present, then I wouldn’t create a new job version. @Julien Le Dem opened an issue a while back about this https://github.com/MarquezProject/marquez/issues/1513. that is, there’s a difference between an empty set [ ]
and null
*Thread Reply:* > This part is missing in PR and our Flink integration always sends all the input & datasets This is very important to note in the code andor API docs
*Thread Reply:* Sure we should. Just wanted to make sure if this the way we want to go.
*Thread Reply:* @Willy Lulciuc did you had a chance to look at this as well https://github.com/MarquezProject/marquez/pull/2654 ? This should be merged before streaming support I believe.
*Thread Reply:* ahh sorry, I hadn’t realized they were related / dependent on one another. sure I’ll give the PR a pass
*Thread Reply:* I looked into your comments and found them, as always, really useful. I introduced changes based on most of them. Please take a look at my reponses within Job
model class. I think there is one issue we still need to discuss.
What to do with existing type
field? I would opt for deprecating
it as within the introduced job facet, a notion of jobType
stands for QUERY|COMMAND|DAG|TASK|JOB|MODEL
, while processingType
determines if a job is batch or streaming.
One solution I see is deprecating type
and introducing JobLabels
class as property within Job
with fields like jobType
, processingType
, integration
Another would be to send processingType
within existing type
field. This would mimic existing API, but require further work. The disadvantage is that we still have mismatch between job type in marquez and openlineage spec.
I would opt for (2), but (1) works for me as well.
I’m working on a redesign of the Ecosystem page for a more modern, user-friendly layout. It’s a work in progress, and feedback is welcome: https://github.com/OpenLineage/docs/pull/258.
Can someone count the folks in the room please? Can’t see anyone other than the speaker
@Michael Robinson can you hear the questions?
*Thread Reply:* I could hear all but one of the questions after the first talk
*Thread Reply:* Oh then it's better than I thought
I just had a lovely conversation at reinvent with the CTO of dbt, Connor, and didn’t even know it was him until the end 🤯
Congrats on a great event!
*Thread Reply:* Yeah it was pretty nice 🙂 A lot of good discussions with Google people. Also Jarek Potiuk was there
*Thread Reply:* I think it won't be the last one Warsaw OpenLineage meetup
https://openlineage.slack.com/archives/C01CK9T7HKR/p1701288000527449 putting it here. I don’t feel like I’m the best person to answer but I feel like operational lineage which we’re trying to provide is the thing
created a project for Blog post ideas: https://github.com/orgs/OpenLineage/projects/5/views/1
Release update: we’re due for an OpenLineage release and overdue for a Marquez release. As tomorrow, the first, is a Friday, we should wait until Monday at the earliest. I’m planning to open a vote for an OL release then, but Marquez is red so I’m holding off on a Marquez release for the time being.
*Thread Reply:* I can address the red
CI status, it’s bc we’re seeing issues publishing our snaphots
*Thread Reply:* I think we should release Marquez on Mon. as well
*Thread Reply:* I want to get this https://github.com/OpenLineage/OpenLineage/pull/2284 into OL release
would be interesting if we can use this comparison as a learning (improve docs, etc): https://blog.singleorigin.tech/race-to-the-finish-line-age/
or rather, use the format for comparing OL with other tools 😉
*Thread Reply:* It would be nice to have something like this (I would want it to be a little more even-handed, though). It will be interesting to see if they will ever update this now that there’s automated lineage from Airflow supported by OL
Review needed of the newsletter section on Airflow Provider progress @Jakub Dardziński @Maciej Obuchowski when you have a moment. It will ship by 5 PM ET today, fyi. Already shared it with you. Thanks!
*Thread Reply:* Thanks @Jakub Dardziński
They finally uploaded the OpenLineage Airflow Summit videos to the Airflow channel on YT: https://www.youtube.com/@ApacheAirflow/videos
On Monday I’m meeting with someone at Confluent about organizing a meetup in London in January. I’m thinking I’ll suggest Jan. 24 or 31 as mid-week days work better and folks need time to come back from the vacation. If you have thoughts on this, would you please let me know by 10:00 am ET on Monday? Also, standup will be happening before the meeting — perhaps we can discuss it then. @Harel Shein
*Thread Reply:* Confluent says January 31st will work for them for a London meetup, and they’ll be providing a speaker as well. Is it safe to firm this up with them?
*Thread Reply:* I'd say yes, eventually if Maciej doesn't get new passport till this time I can speak
*Thread Reply:* I already got the photos 😂
*Thread Reply:* you gotta share them
*Thread Reply:* Also apparently it's possible to get temporary passport at airport in 15 minutes
*Thread Reply:* How civilized...
*Thread Reply:* you can get it in the Warsaw airport just like last-minute passport, costs barely nothing (30 PLN which is ~7/8 USD)
*Thread Reply:* yeah, many people are surprised how developed our public service may be
*Thread Reply:* tbh it's always random, can be good can be shit 🙂
*Thread Reply:* lately it's definitely been better than 10 years ago tho
https://blog.datahubproject.io/extracting-column-level-lineage-from-sql-779b8ce17567 https://datastation.multiprocess.io/blog/2022-04-11-sql-parsers.html
*Thread Reply:* [6] Note that this isn’t a fully fair comparison, since the DataHub one had access to the underlying schemas whereas the other parsers don’t accept that information. 🙂
*Thread Reply:* I’m not sure about the methodology, but these numbers are pretty significant
*Thread Reply:* We tested on a corpus of ~7000 BigQuery SELECT statements and ~2000 CREATE TABLE ... AS SELECT (CTAS) statements.⁶
*Thread Reply:* More doctors smoke camels than any other cigarette 😉 If you test on BigQuery, you will not get comparable results for SnowFlake for example.
Wondering if we can do anything about this. We could write a blog post on lineage extraction from Snowflake SQL queries. This is something we spent time on and possibly we support dialect specific queries that others don't.
*Thread Reply:* it all comes to the question whether we should start publishing comparisons
*Thread Reply:* We can also accept schema information in our sql lineage parser. Actually, this would have been good idea I believe.
*Thread Reply:* for select **
use-case?
Release vote is here when you get a moment: https://openlineage.slack.com/archives/C01CK9T7HKR/p1701722066253149
Should we disable openlineage-airflow
on Airflow 2.8 to force people to use provider?
*Thread Reply:* it sounds like maybe something about this should be included in the 2.8 docs. The dev rel team is talking about the marketing around 2.8 right now…
*Thread Reply:* also, the release will evidently be out next Thursday
*Thread Reply:* I mean, openlineage-airflow
is not part of Airflow
*Thread Reply:* We'd have provider for 2.8
*Thread Reply:* so maybe the airflow newsletter would be better
*Thread Reply:* is there anything about the provider that should be in the 2.8 marketing?
*Thread Reply:* I don't think so
*Thread Reply:* Kenten wants to mention that it will be turned off in the 2.8 docs, so please lmk if anything about this changes
https://github.com/OpenLineage/docs/pull/263
Changelog PR for 1.6.0: https://github.com/OpenLineage/OpenLineage/pull/2298
*Thread Reply:* that’s weird ruff-lint found issues, especially when it has ruff version pinned
*Thread Reply:* CHANGELOG.md:10: acccording ==> according
this change is accurate though 🙂
*Thread Reply:* I tried to sneak in a fix in dev but the linter didn’t like it so I changed it back. All set now
*Thread Reply:* The release is in progress
*Thread Reply:* ah, gotcha
dev/get_changes.py:49:17: E722 Do not use bare `except`
dev/get_changes.py:49:17: S112 `try`-`except`-`continue` detected, consider logging the exception
for next time just add except Exception:
instead of except:
🙂
*Thread Reply:* GTK, thank you
The release-integration-flink
job failed with this error message:
Execution failed for task ':examples:stateful:compileJava'.
> Could not resolve all files for configuration ':examples:stateful:compileClasspath'.
> Could not find io.**********************:**********************_java:1.6.0-SNAPSHOT.
Required by:
project :examples:stateful
*Thread Reply:* No cache is found for key: v1-release-client-java--rOhZzScpK7x+jzwfqkQVwOVgqXO91M7VEEtzYHNvSmY=
Found a cache from build 155811 at v1-release-client-java-
is this standard behaviour?
*Thread Reply:* well, same happened for 1.5.0 and it worked
*Thread Reply:* we gotta wait for Maciej/Pawel :<
*Thread Reply:* Looks like Gradle version got bumped and gives some problems
*Thread Reply:* Think we can release by midday tomorrow?
*Thread Reply:* oh forgot about this totally
Feedback sought on a redesign of the ecosystem page that (hopefully) freshens and modernizes the page: https://github.com/OpenLineage/docs/pull/258
Changelog PR for 1.6.1: https://github.com/OpenLineage/OpenLineage/pull/2301
*Thread Reply:* @Maciej Obuchowski the flink job failed again
*Thread Reply:* well, at least it's a different error
*Thread Reply:* one more try? https://github.com/OpenLineage/OpenLineage/pull/2302 @Michael Robinson
*Thread Reply:* 1.6.2 changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2304
*Thread Reply:* @Maciej Obuchowski 👆
*Thread Reply:* going out for a few hours, so next try would be tomorrow if it fails again...
*Thread Reply:* Thanks, Maciej. That worked, and 1.6.2 is out.
Starting a thread for collaboration on the community meeting next week
*Thread Reply:* Releases: 1.6.2
*Thread Reply:* 2023 recap/“best-of”?
*Thread Reply:* @Harel Shein any thoughts? Also, does anyone know if Julien will be back from vacation?
*Thread Reply:* We should probably try to something with Google proposal
*Thread Reply:* Not sure if it needs additional discussion, maybe just implementation?
*Thread Reply:* I can ask him, but it would probably be good if you could facilitate next week @Michael Robinson?
*Thread Reply:* I agree that we need to address those Google proposals, we should ask Jens if he’s up for presenting and discussing them first?
*Thread Reply:* maybe Pawel wants to present progress with https://github.com/OpenLineage/OpenLineage/issues/2162?
*Thread Reply:* Still waiting on a response from Jens
*Thread Reply:* I think Jens does not have a lot of time now
*Thread Reply:* Emailed him in case he didn’t see the message
*Thread Reply:* Jens confirmed
*Thread Reply:* He will have to join about 15 minutes late
*Thread Reply:* would love to come but I'm at friend's birthday at that time 😐
*Thread Reply:* I’d love to as well, but have diner plans 😕
*Thread Reply:* count me in if not too late
@Paweł Leszczyński mind giving this PR a quick look? https://github.com/MarquezProject/marquez/pull/2700 … it’s a dep on https://github.com/MarquezProject/marquez/pull/2698
*Thread Reply:* thanks @Paweł Leszczyński for the +1 ❤️
@Jakub Dardziński: In Marquez, metrics are exposed via the endpoint /metrics
using prometheus (most of the custom metrics defined are here). Oddly enough, prometheus roadmap states that they have yet to adopt OpenMetrics! But, you can backfill the metrics into prometheus. So, knowing this, I would move to using metrics core by Dropwizard and us an exporter to export metrics to datadog using metrics-datadog. The one major benefit here is that we can define a framework around defining custom metrics internally within Marquez using core Dropwizard libraries, and then enable the reporter via configuration to emit metrics in marquez.yml
: For example:
`metrics:
frequency: 1 minute # Default is 1 second.
reporters:
- type: datadog
.
.
*Thread Reply:* I tested this actually and it works the only thing is traces, I found it very poor to just have metrics around function name
*Thread Reply:* I totally agree, although I feel metrics and tracing are two separate things here
*Thread Reply:* I really appreciate your help and advice! 🙂
*Thread Reply:* Of course, happy to chime in here
*Thread Reply:* I’m just happy this is getting some much needed love 😄
*Thread Reply:* Also, it seems like datadog uses OpenTelemetry: > Datadog Distributed Tracing allows you easily ingest traces via the Datadog libraries and agent or via OpenTelemetry And looks like OpenTelemetry has support for Dropwizard
*Thread Reply:* yep, that's why I liked otel idea
*Thread Reply:* Also, here are the docs for DD + OpenTelemetry … so enabling OpenTelemetry in Marquez would be doable
*Thread Reply:* and we can make all of the configurable via marquez.yml
*Thread Reply:* hit me up with any questions! (just know, there will be a delay)
*Thread Reply:* > and we can make all of the configurable via marquez.yml
it ain’t that easy - we would need to build extended jar with OTEL agent which I think is way too much work compared to benefits. you can still configure via env vars or system properties
I’ve been looking into partitioning for psql, think there’s potential here for huge perf gains. Anyone have experience?
*Thread Reply:* partition ranges will give a boost by default
*Thread Reply:* Which tables do you want to partition? Event ones?
*Thread Reply:* • runs
• job_versions
• dataset_versions
• lineage_events
• and all the facets tables
@Paweł Leszczyński • PR 2682 approved with minor comments on stream versioning logic / suggestions ✅ • PR 2654 approved with minor comment (we’ll want to do a follow up analysis on the query perf improvements) ✅
*Thread Reply:* Thanks @Willy Lulciuc. I applied all the recent comments and merged 2654.
There is one discussion left in 2682, which I would like to resolve before merging. I added extra comment on the implemented approach and I am open to get to know if this is the approach we can go with.
@Julien Le Dem @Maciej Obuchowski discussion is about when to create a new job version for a streaming job. No deep dive in the code is required to take part in it. https://github.com/MarquezProject/marquez/pull/2682#discussion_r1425108745
*Thread Reply:* awesome, left my final thoughts 👍
Maybe we should clarify the documentation on adding custom facets at the integration level? Wdyt? https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589|https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589
Hey, i think it would help some people using Airflow integration (with Airflow 2.6) if we release a patch version of OL package with this PR included #2305. I am not sure what is the release cycle here, but maybe there is already an ETA on next patch release? If so, please let me know 🙂 Thanks !
*Thread Reply:* you gotta ask for the release in #general, 3 votes from committers approve immediate release 🙂
*Thread Reply:* @Michael Robinson 3 votes are in 🙂
*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1702474416084989
*Thread Reply:* Thanks for the ping. I replied in #general and will initiate the release as soon as possible.
seems that we don’t output the correct namespace as in the naming doc for Kafka. we output the kafka server/broker URL as namespace (in the Flink integration specifically) https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#kafka
*Thread Reply:* @Paweł Leszczyński, would you be able to add the Kafka: prefix to the Kafka visitors in the flink integration tomorrow?
*Thread Reply:* I am happy to do this. Just to make, sure: docs is correct, flink implementation is missing kafka://
prefix, right?
*Thread Reply:* Thanks @Paweł Leszczyński. made a couple of suggestions, but we can def merge without
*Thread Reply:* would love to discuss this first. If a user stores an iceberg table in S3, then should it conform S3 naming or iceberg naming?
it's S3 location which defines a dataset. iceberg is a format for accessing data but not identifier as such.
*Thread Reply:* No rush, just something we noticed and that some people in the community are implementing their own patch for it.
my next year goal is to have programmatic way of using naming convention
*Thread Reply:* nope but would be worth reaching out to them to see how we could collaborate? they’re part of the LFAI (sandbox): https://github.com/bitol-io/open-data-contract-standard
*Thread Reply:* background https://medium.com/profitoptics/data-contract-101-568a9adbf9a9
*Thread Reply:* We should still have a conversation:)
If you want to join the conversation on Ray.io integration: https://join.slack.com/t/ray-distributed/shared_invite/zt-2635sz8uo-VW076XU6bKMEiFPCJWr65Q
*Thread Reply:* is there any specific channel/conversation?
*Thread Reply:* Yeah, but it’s private. Added you. For everyone else, Ping me on slack when you join and I’ll add you.
A vote to release Marquez 0.43.0 is open. We need one more: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1702657403267769
*Thread Reply:* the changelog PR is RFR
AWS is making moves! https://github.com/aws-samples/aws-mwaa-openlineage
*Thread Reply:* the repo itself is pretty old, last updated 2mo ago and used OL package not provider (1.4.1)
*Thread Reply:* still it's nice they're doing this :)
*Thread Reply:* since they’re using MWAA they won’t be affected by turn-off with coming with Airflow 2.8 for a while. Otherwise that would be a good excuse to get in touch with them
*Thread Reply:* I think this repo was related to the blog which was authored a while back - https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/ No other moves from our end so far, at least MWAA team : )
*Thread Reply:* Hi All, I am one of the owners of this repo and working to update this to work with MWAA 2.8.1, with apache-airflow-providers-openlineage==1.4.0. I am facing an issue with my set-up. I am using Redshift SQL as a sample use-case for this and getting an error relating to the Default Extractor. Haven't really looked at this at much detail yet but wondering if you have thoughts? I just updated the env variables to use: AIRFLOWOPENLINEAGETRANSPORT and AIRFLOWOPENLINEAGENAMESPACE and changed operator from PostgresOperator to SQLExecuteQueryOperator.
[2024-03-07 03:52:55,496] Failed to extract metadata using found extractor <airflow.providers.openlineage.extractors.base.DefaultExtractor object at 0x7fc4aa1e3950> - section/key [openlineage/disabled_for_operators] not found in config task_type=SQLExecuteQueryOperator airflow_dag_id=rs_source_to_staging task_id=task_insert_event_data airflow_run_id=manual__2024-03-07T03:52:11.634313+00:00
[2024-03-07 03:52:55,498] section/key [openlineage/config_path] not found in config
[2024-03-07 03:52:55,498] section/key [openlineage/config_path] not found in config
[2024-03-07 03:52:55,499] Executing:
insert into event
SELECT eventid, venueid, catid, dateid, eventname, starttime::TIMESTAMP
FROM s3_datalake.event;
*Thread Reply:* @Paul Wilson Villena It looks like a small mistake in the OL, that I'll fix in the next version - we missed adding a callback there, and getting the airflow configuration raises error when disabled_for_operators
is not defined in the airflow.cfg file / the env variable. For now it should help to simply add the <a href="https://airflow.apache.org/docs/apache-airflow-providers-openlineage/1.4.0/configurations-ref.html#id1">[openlineage]</a>
section to airflow.cfg
, and set disabled_for_operators=""
, or just export AIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS=""
,
*Thread Reply:* Will be released in the next provider version: https://github.com/apache/airflow/pull/37994
*Thread Reply:* Hi @Kacper Muda it seems I need to also set this: Otherwise this error persists:
section/key [openlineage/config_path] not found in config
os.environ["AIRFLOW__OPENLINEAGE__CONFIG_PATH"]=""
*Thread Reply:* Yes, sorry for missing that. I fixed in the code and forgot to mention it. If You were to not use AIRFLOW__OPENLINEAGE__TRANSPORT
You'd have to set it to empty string as well, as it's missing the same fallback 🙂
*Thread Reply:* @Paul Wilson Villena FYI, apache-airflow-providers-openlineage==1.7.0
has just been released, containing the fix to that problem 🙂
The release is finished. Slack post, etc., coming soon
have we thought of making the SQL parser pluggable?
*Thread Reply:* what do you mean by that?
*Thread Reply:* (this is coming from apple) like what if a user wanted to provide their own parse for SQL in place of the one shipped with our integrations
*Thread Reply:* for example, if/when we integrate with DataHub, can they use their parse instead of one provided
*Thread Reply:* that would be difficult, we would need strong motivation for that 🫥
This question fits with what we said we would try to document more, can someone help them out with it this week? https://openlineage.slack.com/archives/C063PLL312R/p1702683569726449
Airflow 2.8 has been released. Are we still “turning off” the external Airflow integration with this one? What do Airflow users need to know to avoid unpleasant surprises? Kenten is open to including a note in the 2.8 blog post.
*Thread Reply:* As a newcomer here, I believe it would be wise to avoid supporting Airflow 2.8+ in the openlineage-airflow
package. This approach would encourage users to transition to the provider package. It's important to clearly communicate that ongoing development and enhancements will be focused on the apache-airflow-providers-openlineage
package, while the openlineage-airflow
will primarily be updated for bug fixes. I'll look into whether this strategy is already noted in the documentation. If not, I will propose a documentation update.
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2330 Please let me know if some changes are required, i was not sure how to properly implement it.
This looks cool, might be useful for us? https://github.com/aklivity/zilla
*Thread Reply:* the license is a bit weird, but should be ok for us. it’s apache, unless you directly compete with the company that built it.
*Thread Reply:* tbh not sure how
*Thread Reply:* I think we should be focused on 1) being compatible with most popular solutions (kafka...) 2) being easy to integrate with (custom transports)
rather than forcing our opinionated way on how OpenLineage events should flow in customer architecture
Apologies for having to miss today’s committer sync — I’ll be picking up my daughter from school
WDYT about starting to add integration specific channels and adding a little welcome bot for people when they join?
*Thread Reply:* the -dev
and -users
seems like overkill, but also understand that we may want to split user questions from development
*Thread Reply:* maybe just shorten to spark-integration
, flink-integration
, etc. Or integrations-spark
etc
*Thread Reply:* we probably should consider a development
and welcome
channel
*Thread Reply:* yeah.. makes sense to me. let’s leave this thread open for a few days so more people can chime in and then I’ll make a proposal based on that.
*Thread Reply:* makes sense to me
*Thread Reply:* I think there is not enough discussion for it to make sense
*Thread Reply:* and empty channels do not invite to discussion
*Thread Reply:* Maybe worth it for spark questions alone? And then for equal coverage we need the others. It’s getting easy to overlook questions in general due to the increased volume and long code snippets, IMO.
*Thread Reply:* yeah I think the volume is still quite low
*Thread Reply:* something like Airflow's #troubleshooting channel easily has order of magnitude more messages
*Thread Reply:* and even then, I'd split between something like #troubleshooting and #development rather than between integrations
*Thread Reply:* not only because it's too granular, but also there's a development that isn't strictly integration related or touches multiple ones
Link to the vote to release the hot fix in Marquez: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1703101476368589
For the newsletter this time around, I’m thinking that a year-end review issue might be nice in mid-January when folks are back from vacation. And then a “double issue” at the end of January with the usual updates. We’ve still got a rather, um, “select” readership, so the stakes are low. If you have an opinion, please lmk.
*Thread Reply:* I’m for mid-January option
1.7.0 changelog PR needs a review: https://github.com/OpenLineage/OpenLineage/pull/2331
Notice for the release notes (WDYT?):
COMPATIBILITY NOTICE
Starting in 1.7.0, the Airflow integration will no longer support Airflow versions >=2.8.0.
Please use the OpenLineage Airflow Provider instead.
It includes a link to here: https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html
https://eu.communityovercode.org/ is that proper conference to talk about OL?
*Thread Reply:* Yes! At least to a more broader audience
Happy new year all!
Am I the only one that sees Free trial in progress here
in OL slack?
*Thread Reply:* same for me, I think Slack initiated that
*Thread Reply:* we’re on the Slack Pro trial (we were on the free plan before)
*Thread Reply:* I think Slack initiated it
https://github.com/OpenLineage/OpenLineage/issues/2349 - this issue is really interesting. I am hoping to see follow-up from David.
So who wants to speak at our meetup with Confluent in London on Jan. 31st?
*Thread Reply:* do we have sponsorship to fly over?
*Thread Reply:* Not currently but I can look into it
*Thread Reply:* do we have other active community members based in the UK?
*Thread Reply:* I’ll ask around
*Thread Reply:* Abdallah Terrab at Decathlon has volunteered
*Thread Reply:* does it mean we're still looking for someone?
*Thread Reply:* I already told I'll go last month, but not sure if it's still needed
*Thread Reply:* and does Confluent have a talk there?
*Thread Reply:* Sorry about that, Maciej. I’ll ask Viraj if Astronomer would cover your ticket. There will be a Confluent speaker.
*Thread Reply:* if we need to choose between kafka summit and a meetup - I think we should go for kafka summit 🙂
*Thread Reply:* I think so too
*Thread Reply:* Viraj has requested approval for summit, and we can expect to hear from finance soon
*Thread Reply:* Also, question from him about the talk: what does “streaming” refer to in the title — Kafka only?
*Thread Reply:* Kafka, Spark, Flink
*Thread Reply:* If someone wants to do a joint talk let me know 😉
*Thread Reply:* @Willy Lulciuc will you be in UK then?
*Thread Reply:* I can be, if astronomer approves? but also realizing it’s Jan 31st, so a bit tight
*Thread Reply:* yeah, that sounds… unlikely 🙂
*Thread Reply:* also thinking about all the traveling ive been doing recently and things I need to work on. would be great to have some focus time
did anyone submit a talk to https://www.databricks.com/dataaisummit/call-for-presentations/?
*Thread Reply:* tagging @Julien Le Dem on this one. since the deadline is tomorrow.
*Thread Reply:* I don’t have my computer with me. ⛷️
*Thread Reply:* Does @Willy Lulciuc want to submit? Happy to be co-speaker (if you want. But not necessary)
*Thread Reply:* Willy is also on vacation, I’m happy to submit for the both of us. I’ll try to get something out today
*Thread Reply:* would be great to have someting on Databricks conference 🙂
*Thread Reply:* what I’m currently thinking. learnings from openlineage adoption in Airflow and Flink, and what can be learned / applied on Spark lineage.
This month’s TSC meeting is next Thursday. Anyone have any items to add to the agenda?
*Thread Reply:* @Kacper Muda would you want to talk about doc changes in Airflow provider maybe?
*Thread Reply:* no pressure if it's too late for you 🙂
*Thread Reply:* It's fine, I could probably mention something about it - the problem is that I have a standing commitment every Thursday from 5:30 to 7:30 PM (Polish time, GMT+1) which means I'm unable to participate. 😞
*Thread Reply:* @Kacper Muda would you be open to recording something? We could play it during the meeting. Something to consider if you’d like to participate but the time doesn’t work.
*Thread Reply:* Let me see how much time I'll have during the weekend and come back to You 🙂
*Thread Reply:* Sorry, I got sick and won't be able to do it. Maybe i'll try to make it personally to the next meeting, then the docs should already be released 🙂
*Thread Reply:* @Julien Le Dem are there updates on open proposals that you would like to cover?
*Thread Reply:* @Paweł Leszczyński as you authored about half of the changes in 1.7.0, would you be willing to talk about the Flink fixes? No slides necessary
*Thread Reply:* @Michael Robinson no updates from me
*Thread Reply:* sry @Michael Robinson, I won't be able to join this time. My changes in 1.7.0 are rather small fixes. Perhaps someone else can present them shortly.
I saw some weird behavior with openlineage-airflow
where it will not respected the transport config for the client, even when setting the OPENLINEAGE_CONFIG
to point to a config file.
the workaround is that if you set the OPENLINEAGE_URL
env it will reach that and read the config.
this bug doesn’t seem to exist in the airflow provider since the loading method is completely different.
*Thread Reply:* would you mind creating an issue?
*Thread Reply:* will do. let me repro on the latest version of openlineage-airflow
and see if I can repro on the provider
*Thread Reply:* config is definitely more complex than it needs to be...
*Thread Reply:* hmm.. so in 1.7.0
if you define OPENLINEAGE_URL
then it completely ignores whatever is in OPENLINEAGE_CONFIG
yaml
*Thread Reply:* if you don’t define OPENLINEAGE_URL
, and you do define OPENLINEAGE_CONFIG
: then openlineage is disabled 😂
*Thread Reply:* are you sure OPENLINEAGE_CONFIG
points to something valid?
*Thread Reply:* ~look at the code flow for create:~ ~the default factory doesn’t supply config, so it tried to set http config from env vars. if that doesn’t work, it just returns the console transport~
*Thread Reply:* oh, no nvm. it’s a different flow.
*Thread Reply:* but yes, OPENLINEAGE_CONFIG
works for sure
*Thread Reply:* on 0.21.1, it was working when the OPENLINEAGE_URL
was supplied
*Thread Reply:* transport_config = None if "transport" not in self.config else self.config["transport"]
self.transport = factory.create(transport_config)
this self.config
actually looks at
@property
def config(self) -> dict[str, Any]:
if self._config is None:
self._config = load_config()
return self._config
which then uses this:
```def loadconfig() -> dict[str, Any]:
file = _findyaml()
if file:
try:
with open(file) as f:
config: dict[str, Any] = yaml.safe_load(f)
return config
except Exception: # noqa: BLE001, S110
# Just move to read env vars
pass
return defaultdict(dict)
def findyaml() -> str | None: # Check OPENLINEAGECONFIG env variable path = os.getenv("OPENLINEAGECONFIG", None) try: if path and os.path.isfile(path) and os.access(path, os.R_OK): return path except Exception: # noqa: BLE001 if path: log.exception("Couldn't read file %s: ", path) else: pass # We can get different errors depending on system
# Check current working directory:
try:
cwd = os.getcwd()
if "openlineage.yml" in os.listdir(cwd):
return os.path.join(cwd, "openlineage.yml")
except Exception: # noqa: BLE001, S110
pass # We can get different errors depending on system
# Check $HOME/.openlineage dir
try:
path = os.path.expanduser("~/.openlineage")
if "openlineage.yml" in os.listdir(path):
return os.path.join(path, "openlineage.yml")
except Exception: # noqa: BLE001, S110
# We can get different errors depending on system
pass
return None```
*Thread Reply:* oh I think I see
*Thread Reply:* so this isn't passed if you have config but there is no transport
field in this config
transport_config = None if "transport" not in self.config else self.config["transport"]
self.transport = factory.create(transport_config)
*Thread Reply:* here’s the config I’m using: ```transport: type: "kafka" config: bootstrap.servers: "kafka1,kafka2" security.protocol: "SSL"
# CA certificate file for verifying the broker's certificate.
ssl.ca.location=ca-cert
# Client's certificate
ssl.certificate.location=client_?????_client.pem
# Client's key
ssl.key.location=client_?????_client.key
# Key password, if any.
ssl.key.password=abcdefgh
topic: "SOF0002248-afaas-lineage-DEV-airflow-lineage" flush: True```
*Thread Reply:* it should load, but fail when actually trying to emit to kafka
*Thread Reply:* but it should still init the transport
*Thread Reply:* I’m testing on this image: ```FROM quay.io/astronomer/astro-runtime:6.4.0
COPY openlineage.yml /usr/local/airflow/
ENV OPENLINEAGE_CONFIG=/usr/local/airflow/openlineage.yml
ENV AIRFLOWCORELOGGING_LEVEL=DEBUG
*Thread Reply:* are you sure there are no permission errors?
*Thread Reply:* def _find_yaml() -> str | None:
# Check OPENLINEAGE_CONFIG env variable
path = os.getenv("OPENLINEAGE_CONFIG", None)
try:
if path and os.path.isfile(path) and os.access(path, os.R_OK):
return path
except Exception: # noqa: BLE001
if path:
log.exception("Couldn't read file %s: ", path)
else:
pass # We can get different errors depending on system
it checks stuff like
os.access(path, os.R_OK)
*Thread Reply:* I’m sure, because if I uncomment
ENV OPENLINEAGE_URL=<http://foo.bar/>
on 0.21.1, it works
*Thread Reply:* ah, I can add a permissive chmod to the dockerfile to see if it helps
*Thread Reply:* but I’m also not seeing any log/exception in the task logs
*Thread Reply:* will look at this later if you won't find solution 🙂
*Thread Reply:* one more thing, can you try with just
transport:
type: console
*Thread Reply:* but yeah, there’s something not great about separation of concerns between client config and adapter config
*Thread Reply:* adapter should not care, unless you're using MARQUEZ_URL
... which is backwards compatibility for when it was still marquez airflow integration
I’m starting to put together the year-in-review issue of the newsletter and wonder if anyone has thoughts on the “big stories” of 2023 in OpenLineage. So far I’ve got: • Launched the Airflow Provider • Added static AKA design lineage • Welcomed new ecosystem partners (Google, Metaphor, Grai, Datahub) • Started meeting up and held events with Metaphor, Google, Collibra, etc. • Graduated from the LFAI What am I missing? Wondering in particular about features. Is iceberg support in Flink a “big” enough story? Custom transport types? SQL parser improvements?
Blog posts content about contributing to Openlineage-spark and code internals. The content comes from november meetup at google and I split it into two posts: https://docs.google.com/document/d/1Hu6clFckse1J_M1w2MMaTTJS0wUihtFsxbDQchtTVtA/edit?usp=sharing
Does anyone remember why execution_date
was chosen as part of the runid for an Airflow task, instead of, for example, start_date
?
Due to this decision, we can encounter duplicate runid if we delete the DagRun from the database, because the execution_date
remains the same. If I run a backfill job for yesterday, then delete it and run it again, I get the same ids. I'm trying to understand the rationale behind this choice so we can determine whether it's a bug or a feature. 😉
*Thread Reply:* start_date
is unreliable AFAIK, there can be no start date sometimes
*Thread Reply:* this might be true for only some version of airflow
*Thread Reply:* Also here where they define combination of some deterministic attributes, executiondate is used and not startdate, so there might be something to it. That still leaves us with the behaviour i described.
*Thread Reply:* > Due to this decision, we can encounter duplicate run_id if we delete the DagRun from the database, because the execution_date
remains the same.
Hmm so given that OL runID uses the same params Airflow uses to generate the hash, this seems more like a limitation. Better question would be: if a user runs, deletes, then runs the same DAG again, is that an expected scenario we should handle? tl’dr yes, but Airflow hasn’t felt it important enough to address.
*Thread Reply:* not concurrent with Databricks conference this year? 😂
*Thread Reply:* nope, a week before so that everyone goes there and gets sick and can’t attend the databricks conf on the following week
*Thread Reply:* outstanding business move again
*Thread Reply:* @Willy Lulciuc wanna submit to this?
*Thread Reply:* yeah id love to: maybe something like “Detecting Snowflake table schema changes with OpenLineage events” + use cases + using lineage to detect impact?
*Thread Reply:* yeah! that sounds like a fun talk!
*Thread Reply:* idk if @Julien Le Dem will already be in 🇫🇷 that week? but I’d be happy to co-present if not.
*Thread Reply:* I don’t know when I’m flying out yet but it will be in the middle of that time frame.
*Thread Reply:* +1 on Harel co-presenting :)
*Thread Reply:* School last day is the 4th. I need to be in France (not jet lagged) before the 8th
*Thread Reply:* ok, @Harel Shein I’ll work on getting a rough draft ready before the deadline (added to my TODO of tasks)
hey, I'm not feeling well, will probably skip today meeting
Added comment to discussion about Spark parent job issue: https://github.com/OpenLineage/OpenLineage/issues/1672#issuecomment-1883524216 I think we have the consensus so I'll work on it.
*Thread Reply:* @Maciej Obuchowski should we give an update on that at the TSC meeting tomorrow?
Another issue: do you think we should somehow handle the partitioning in OpenLineage standard and integrations? I would think of a situation where we somehow know how dataset is partitioned - not think about how to automagically detect that. Some example situations:
*Thread Reply:* interesting. did you hear anyone with this usecase/requirement?
Hey, there’s a Windows user getting this error when trying to run Marquez: org-apache-tomcat-jdbc-pool-connectionpool-unable-to-create-initial-connection
. Is it a driver issue? I’ll try to get more details and reproduce it, but if you know what this probably is related to, please lmk
*Thread Reply:* do we even support Windows? 😱
*Thread Reply:* Here’s more info about the use case:
Thanks Michael , this is really helpful , so I am working on prj where in I need to run marquez and open lineage on top of airflow dags which run dbt commands internally thru Bashoperator. I need to present to my org if we are going to be benefited by bringing in marquez matadata lineage
10:48
so was taking this approach of setting marquez first , then will explore how it integrates with airflow using bashoperator
*Thread Reply:* We don’t support this operator, right? What kind of graph can they expect?
*Thread Reply:* They can use dbt integration directly maybe?
*Thread Reply:* We now have a former astronomer as engineering director at DataHub
*Thread Reply:* https://www.linkedin.com/in/samantha-clark-engineer/
taking on letting users to run integration tests
*Thread Reply:* so there are two issues I think
*Thread Reply:* in Airflow workflows there are:
filters:
branches:
ignore: /pull\/[0-9]+/
which are set only for PRs with forked repos
*Thread Reply:* in Spark there are tests that strictly require env vars (that contain credentials to various external systems like databricks or bigquery). if there are no such env vars tests fail which is confusing new committers
*Thread Reply:* first behaviour is silent - which I think is bad because it’s easy to skip integration tests, build is green (but should it be? we don’t know, integration tests didn’t run and someone needs to know and remember that before approving and merging)
*Thread Reply:* second is misleading because it hints there’s something wrong in the code while it doesn’t neccessarily need to be. on the other hand you shouldn’t approve and merge failing build so you see there’s some action required
*Thread Reply:* reg. action required: for now we’ve been running some manual step (using https://github.com/jklukas/git-push-fork-to-upstream-branch) which is a workaround but it’s not straightforward and requires manual work. it also doesn’t solve two issues I mentioned above
*Thread Reply:* what I propose is to simply add approval step before integration tests: https://circleci.com/docs/workflows/#holding-a-workflow-for-a-manual-approval
it’s circleCI only thing so you need to login into circleCI, check if there’s any pending task to approve and then approve or not
*Thread Reply:* it doesn’t allow for much of configuration but I think it would work. you can’t also integrate it in any way with github UI (e.g. there’s no option to click something in PR’s UI to approve)
*Thread Reply:* but that would let project maintainers to manage when the code is safe to run and it’s still visible and (I think) readable for everyone
*Thread Reply:* the only thing I’m not sure about is who can approve
Anyone who has push access to the repository can click the **Approval** button to continue the workflow
but I’m not sure to which repo. if someone runs on fork and he/she has push access for fork - can he/she approve? it wouldn’t make sense..
*Thread Reply:* https://circleci.com/blog/access-control-cicd/
that’s best I could find from circleCI on this subject
*Thread Reply:* so I think the best solution would be to:
Pass secrets to builds from forked pull requests
(requires careful review of the CI process)*Thread Reply:* contexts give possibility to let users run e.g. unit tests in CI without exposing credentials
*Thread Reply:* this approach makes sense to me, assuming the permission model is how you outlined it
*Thread Reply:* one thing to add and test: approval steps could have condition to always run if it’s not from fork. not sure if that’s possible
*Thread Reply:* that sounds like it should be doable within what available in circle. GHA can definitely do that
*Thread Reply:* GHA can do things that circleCI can’t 😂
*Thread Reply:* > approval steps could have condition to always run if it’s not from fork. not sure if that’s possible ffs it’s not that easy to set up
*Thread Reply:* whenever I touch circleCI base changes I feel like magician. yet, here goes the PR with the magic (just look at the side effect, it bothered me for so long 🪄 ) 🚨 🚨 🚨 https://github.com/OpenLineage/OpenLineage/pull/2374
*Thread Reply:* I'm assuming the reason for the speedup in determine_changed_modules
is that we don't go install yq
every time this runs?
What are your nominees/candidates for the most important releases of 2023? I’ll start (feel free to disagree with my choices, btw): • 1.0.0 • 1.7.0 (disabled the external Airflow integration for 2.8+) • 0.26.0 (Fluentd) • 0.19.2 (column lineage in SQL parser) • …
*Thread Reply:* • 1.7.0 (disabled the external Airflow integration for 2.8+) that doesn’t sound like one of the most important
*Thread Reply:* 1.0.0 had this: https://openlineage.io/docs/releases/1_0_0 which actually fixed spec to match JSON schema spec
@Gowthaman Chinnathambi has joined the channel
1.8 changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2380
FYI: I'm trying to set up our meetings with the new LFAI tooling, you may get some emails. you can ignore for now.
@Krishnaraj Raol has joined the channel
@Harel Shein @Julien Le Dem @tati Python client releases on both Marquez and OpenLineage are failing because PyPI no long supports password authentication. We need to configure the projects for Trusted Publishers or use an API token. I’ve looked and can’t find OpenLineage credentials for PyPI, but if I had them we’d still need to set up 2FA in order to make the change. How should we proceed here? Should we sync to sort this out? Thanks (btw I reached out to Willy separately when this came up during a Marquez release attempt last week)
*Thread Reply:* ah! I can look into it now.
*Thread Reply:* I'm failing to find the credentials for PyPI anywhere.
*Thread Reply:* @Maciej Obuchowski any ideas? (git blame shows you wrote that part, and @Michael Collado did some circleCI setup at some point)
*Thread Reply:* I just sent a reset password email to whoever registered for the openlineage
user
*Thread Reply:* Thanks @Harel Shein. Can confirm I didn’t get it
*Thread Reply:* The password should be in the CircleCI context right?
*Thread Reply:* yes, it's there. I was trying to avoid writing a job that prints it out
*Thread Reply:* that will be the fallback if no one responds 🙂
*Thread Reply:* alright, I've setup 2FA and added a few more emails to the PyPI account as fallback.
*Thread Reply:* unfortunately, there's only one Trusted Publisher
for PyPI, which is GH Actions. so we'll have to use the API token route. PR incoming soon
*Thread Reply:* didn't need to make any changes. I updated the circle context and re-ran the PyPI release - we're back to :largegreencircle:
*Thread Reply:* ^ @Michael Robinson FYI
*Thread Reply:* thank you @Harel Shein. Releasing the jars now. Everything looks good
It looks like my laptop bit the dust today so might miss the sync
Do I have to create account to join the meeting?
*Thread Reply:* turns out you can just pass your mail
*Thread Reply:* same on my side
*Thread Reply:* i don't remember if i have one
I did something wrong, meeting link should we good now: https://zoom-lfx.platform.linuxfoundation.org/meeting/91671382513?password=424b74a1-43fa-4d0e-885f-c9b5417cf57b
*Thread Reply:* use same e-mail you got invited to
there's a data warehouse (https://www.firebolt.io/) and a streaming platform (https://memphis.dev/) written in Go. so I guess it's not futile to write a Go client? 🙂
Any potential issues with scheduling a meetup on Tuesday, March 19th in Boston that you know of? The Astronomer all-hands is the preceding week
PR to add 1.8 release notes to the docs needs a review: https://github.com/OpenLineage/docs/pull/274
*Thread Reply:* thanks @Jakub Dardziński
Feedback requested on this draft of the year-in-review issue of the newsletter: https://docs.google.com/document/d/1MJB9ughykq9O8roe2dlav6d8QbHZBV2A0bTkF4w0-jo/edit?usp=sharing. Did you give a talk that isn't in the talks section? Is there an important release that should be in the releases section but isn't? Other feedback? Please share.
Feedback requested on a new page for displaying the ecosystem survey results: https://github.com/OpenLineage/docs/pull/275. The image was created for us by Amp. @Julien Le Dem @Harel Shein @tati
*Thread Reply:* Looks great!
*Thread Reply:* Very cool indeed! I wonder if we should share the raw data as well?
*Thread Reply:* Maybe if you could share it here first @Michael Robinson ?
*Thread Reply:* Yes, planning to include a link to the raw data as well and will share here first
*Thread Reply:* @Harel Shein thanks for the suggestion. Lmk if there's a better way to do this, but here's a link to Google's visualizations: https://docs.google.com/forms/d/1j1SyJH0LoRNwNS1oJy0qfnDn_NPOrQw_fMb7qwouVfU/viewanalytics. And a .csv is attached. Would you use this link on the page or link to a spreadsheet instead?
*Thread Reply:* Going with linking to Google's charts for the raw data for now. LMK if you'd prefer another format, e.g. Google sheet
*Thread Reply:* was just looking at it, looks great @Michael Robinson!
*Thread Reply:* Excellent work, @Michael Robinson! 👏 👏 👏
This looks nice!
*Thread Reply:* @Jakub Dardziński @Maciej Obuchowski is the Airflow Provider stuff still current?
*Thread Reply:* yeah, looks current
@Laurent Paris has joined the channel
Hey, created issue around using Spark metrics to increase operational ease of using Spark integration, feel free to comment: https://github.com/OpenLineage/OpenLineage/issues/2402
@Lohit VijayaRenu has joined the channel
Flyte is offering a 20-25-minute speaking slot at their community meeting on March 6th at 9 am PT. They'd like it to be a general introduction to OpenLineage
*Thread Reply:* I can take it if no one else is interested. I’ll be doing a lot of intro to OL presentations in the next few weeks, so I’ll be very practiced by then :)
@Emili Parreno has joined the channel
@Paweł Leszczyński / @tati I'm expecting at least 5 pictures from the meetup today! 😄
*Thread Reply:* could we also get a signup sheet and headcount please? 😬
Hi, is there any reason not to perform a release today as scheduled? I know we released 1.8 only one week ago, but it's the first of the month and @Damien Hawes's PR #2390 to add support for Scala 2.12 and 2.13 in Spark, along with fixes in the Spark and Flink integrations, are unreleased. Would it make more sense to wait for Damien's PR #2395?
*Thread Reply:* This isn't ready.
*Thread Reply:* I'm working on enabling integration tests for the Scala 2.13 variants.
*Thread Reply:* It will take some time, probably Monday / Tuesday next week is my ETA.
*Thread Reply:* Thanks @Damien Hawes, no pressure. But if early next week is your estimate I think it makes sense to wait. So this is GTK
*Thread Reply:* marquez in the wild! 💯💯🚀. thanks for sharing!
*Thread Reply:* they're doing anomaly detection successfully
*Thread Reply:* eg, deprecated tables still being used or tables written in multiple locations
*Thread Reply:* this needs to become a blog post!
*Thread Reply:* Would love to get the slides if they are willing to share!
*Thread Reply:* They've said yes to a blog post. This presentation gets us closer to starting in earnest. I've asked for the slides. Too bad the Confluent organizer wasn't supportive of recording. Maybe next time
Congratulations to our new committer on the team @Damien Hawes!!
*Thread Reply:* hey @Damien Hawes, now you can push your branches into origin
and the integration tests are automatically approved 🙂
*Thread Reply:* Thank you for the congratulations @Harel Shein. It is humbling to be nominated and accepted.
@Maciej Obuchowski - haha, thanks!
*Thread Reply:* Congratulations @Damien Hawes! Thank you for all your contributions
https://opensource.googleblog.com/2024/02/announcing-google-season-of-docs-2024.html maybe we should improve our docs? 🙂
Agenda items or discussion topics for Thursday's TSC? @Julien Le Dem @Harel Shein
*Thread Reply:* I can take 5 minutes explaining Spark job hierarchy
*Thread Reply:* Nothing specific on my end
*Thread Reply:* more updates on the Spark side of things? @Paweł Leszczyński / @Damien Hawes may want to talk about the recent additions? we could also discuss the proposals for circuit breakers / metrics?
Anyone have an opinion about creating an OpenLineage "company" rather than a group on LinkedIn? You can get metrics from LinkedIn's API if you have a company rather than a group.
*Thread Reply:* Airflow does it: https://www.linkedin.com/company/apache-airflow/
*Thread Reply:* Spark too https://www.linkedin.com/company/apachespark/
*Thread Reply:* I think that's a good idea
We have agreement from Astronomer to move ahead with the Orbit changes discussed today in the committers sync. So I'll start on the exports asap.
Please follow our new OpenLineage company page on LinkedIn. Evidently, the only way to join the company is to add it to your experience history.
FYI: deadline Feb 25th https://2024.berlinbuzzwords.de/call-for-papers/
*Thread Reply:* @Peter Hicks want to submit a column-level lineage talk?
*Thread Reply:* we'll probably submit something about OpenLineage/Streaming with @Paweł Leszczyński
*Thread Reply:* @Willy Lulciuc want to come to Berlin? 😄
*Thread Reply:* we were thinking with Maciej about some kind of Flink & Streaming around OpenLineage talk, as this can be interesting to the community. I'll prepare abstract next wek
*Thread Reply:* updated the talks project on github
*Thread Reply:* @Maciej Obuchowski i wish, i’d need a sponsor 😅
This open-source community management tool looks interesting as a supplement to Orbit: https://ossinsight.io/analyze/OpenLineage/OpenLineage#overview
*Thread Reply:* you can compare projects side-by-side, which is something Orbit doesn't offer
Metaplane added an airflow provider to send data lineage data to. It's basically a new connection that extends BaseHook, and users need to proactively send callbacks, not sure why the took that approach. https://www.metaplane.dev/blog/airflow-integration
*Thread Reply:* If you're doing all that work, you could actually just dag_policy to add it to all the dags automatically
*Thread Reply:* https://docs.metaplane.dev/docs/airflow#dag-and-task-lineage I think that's fairly... unsophisticated approach?
*Thread Reply:* However, if I was redoing Airflow integration from scratch, I'd seriously rethink using connections instead of OPENLINEAGE_URL
or configuring it the way we did
*Thread Reply:* The plugin could load up custom transport types and generate connection types based out of it
*Thread Reply:* Just curious why they did not use listeners... it's not like it's new feature now, it has been there for like 5 minor releases
*Thread Reply:* we still could add support from Airflow connections with some sort of deprecation of OPENLINEAGE_URL and current way
Hey, i was working on PR updating the docs for python, java and airflow (probably spark is next), and it hit me that we still have those in two places: README.md inside the package and openlineage.io site. Both contain quite the same information, sometimes the site has more (f.e. airflow). Do You think it would be good idea to just put a redirect to the site in README.md files for the packages? Maybe add some brief description at the top and then redirect user to the site? In long term, maintaining both and keeping them in sync is not an optimal solution imho.
*Thread Reply:* I agree -- in addition to the maintenance burden there's the risk posed by out-of-date/conflicting info for users
*Thread Reply:* I've wanted to remove README.md files as user-facing docs for some time.
However, there might be worth keeping those (maybe under different name) for purely internal development docs - not related to external API's, but like - use this incantation to compile the integration.
*Thread Reply:* I added the redirect here https://github.com/OpenLineage/OpenLineage/pull/2448
There was not much information about the compilation and other internal stuff, so i think first they need to be created and then we can keep them inside the package files under some different name, as Maciej mentioned.
2023 OpenLineage Survey Analysis/takeaways What surprised you or struck you as notable in the 2023 survey data? What would you like to see added, changed or removed in the 2024 version? I need your help to ensure we get the most useful and actionable insights we can from this exercise. I created a doc for sharing opinions/comments, but I'd be happy to discuss it in any forum. Here's the doc, including some initial takeaways, as a starting point: https://docs.google.com/document/d/1aiKtKjcFU0AjS46cow6cbx8EvzV0P2KnGQ4rFD4qKLM/edit?usp=sharing.
@Damien Hawes can we share Gradle plugins that you've implemented in Spark buildSrc
with Flink integration too? It would cut down on boilerplate, but not sure how can we do this (without copying code) 🙂
*Thread Reply:* You have to publish the plugin to the Gradle Plugin repositoriy
*Thread Reply:* Seems like we could publish the plugin to the local dir first: https://docs.gradle.org/current/userguide/plugins.html#sec:custom_plugin_repositories
@Maciej Obuchowski in our OL spec, we require _prodcuer
and _schemaURL
, but our airflow provider, we only send the _producer
. Was this an intentional omission of _schemaURL
?
ahh i think I was confused given it’s marked as _base_skip_redact
here, but looks like _schemaURL
is being added… need to verify
*Thread Reply:* schemaURL is always send, the thing is it’s incorrect in many cases AFAIR
*Thread Reply:* yeah… we’re validating events and most (if not all) don’t have that field populated. does the airflow provider not set it?
*Thread Reply:* airlfow provider uses facets from openlineage-python
package
*Thread Reply:* > thing is it’s incorrect in many cases AFAIR more curious about this comment. why would this be the case?
*Thread Reply:* tbh I’m not sure, URL for the base schema is correct only for https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/OpenLineage.json and facets in it (which is not too many). and for others it was not just done or mistakenly repeated with the same pattern?
*Thread Reply:* dunno, looks more like historical approach that wasn’t adjusted at some point
*Thread Reply:* and.. schemaURL is apparently irrelevant since noone validated it and reported issues
*Thread Reply:* is there an open issue for this? I feel we should have some guidance here or not require it, but we should decide what we do here
*Thread Reply:* elephant in the room
*Thread Reply:* no, I don’t think there is one
*Thread Reply:* ok, I’ll open one. we’re ingestion events into kafka and was incorrectly applying validation to events based on the spec
*Thread Reply:* @Julien Le Dem ping to address the elephant above 😉 we’re talking about in this thread
*Thread Reply:* I see. btw generating python facets from json schema would be best solution, it was too complex so far ;_;
*Thread Reply:* ok, I’ll summarize our discussion in the issue
*Thread Reply:* thanks Willy!
*Thread Reply:* @Willy Lulciuc double-checking - given the example of SQLJobFacet should the schema URL be set to: https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/facets/SQLJobFacet.json or https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/spec/facets/SQLJobFacet.json#/$defs/SQLJobFacet or https://openlineage.io/spec/facets/1-0-0/SQLJobFacet.json#/$defs/SQLJobFacet?
or in other words - should it be the same as it is in Java client? 🙂 which is the last one
Abdallah (Decathlon) made a release request today. https://openlineage.slack.com/archives/C01CK9T7HKR/p1708514231690979
*Thread Reply:* I would ask that if we make a release today, we include the Scala 2.13 support for Spark (and merge the PR for the docs)
*Thread Reply:* I guess we have to decide splitting Iceberg off from the main code a reason to hold back the release?
*Thread Reply:* Thoughts @Maciej Obuchowski?
*Thread Reply:* I think we can do a release today/tomorrow, but having functionality removed makes this much harder choice
*Thread Reply:* What functionality would be removed?
*Thread Reply:* Iceberg support? Or do you propose something else?
*Thread Reply:* Oh, I was under the impression that Iceberg support wasn't going to be removed. Instead the direct dependencies on Iceberg in the core code were being removed, and bundled into their own module, but at the end of the day, the project would still contain classes capable of dealing with Iceberg.
*Thread Reply:* I understood that without https://github.com/OpenLineage/OpenLineage/pull/2437/files there will be no support for Iceberg for 2.13
*Thread Reply:* so I guess we can go and then follow up with next release soon?
*Thread Reply:* There should still be Iceberg support for 2.13
*Thread Reply:* If there wasn't, that would hurt us, and by us, I mean the team I belong to @ Booking and my partner teams.
*Thread Reply:* Basically, if I understand Mattia's direction is, we want to say:
OK, OpenLineage has been tested against these versions of Iceberg and found to be working.
*Thread Reply:* Just to confirm, are we waiting for this one https://github.com/OpenLineage/OpenLineage/pull/2446?
*Thread Reply:* We're waiting for comments on this: https://openlineage.slack.com/archives/C01CK9T7HKR/p1708349868363669
*Thread Reply:* No-one has left comments
*Thread Reply:* Which means, at least in my opinion, no-one else has anything to say.
*Thread Reply:* +1. It's been over 48 hrs, so seems safe to go ahead
*Thread Reply:* @Maciej Obuchowski - I'm going to merge that PR, ye?
*Thread Reply:* (First it needs an approval)
*Thread Reply:* Pawel is OOO, I believe
*Thread Reply:* Aye, but I believe @Maciej Obuchowski can approve.
*Thread Reply:* :gh_approved:
*Thread Reply:* Working on the changelog now
*Thread Reply:* oops, forgot about the release vote. please +1
*Thread Reply:* it got merged 👀
*Thread Reply:* amazing feedback on a 10k line PR 😅
*Thread Reply:* maybe they have policy that feedback starts from 10k lines
*Thread Reply:* it wasn’t enough
*Thread Reply:* too big to review, LGTM
*Thread Reply:* sounds like an easy fix? we have time
*Thread Reply:* Easy fix? Yeah - not at all.
*Thread Reply:* Allright, putting the release on hold, then
*Thread Reply:* yeah - we could end up with Spark 2 dependency when using it in Spark 3 context, and that's not good
*Thread Reply:* oh, unless you mean compile time dependency on Spark 2
*Thread Reply:* then no, we need to have it, everything in lifecycle
package depends on it 🙂
The idea is that it contains code common to all Spark versions, that Spark itself mostly does not change - and have spark2/3/... directories for things that specifically diverge from baseline
*Thread Reply:* I assume Paweł ment should not have Spark dependency
as it should not depend on particular Spark version
*Thread Reply:* Correct me if I'm wrong, but it sounds safe to proceed. So here's the changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2452
*Thread Reply:* It's safe
*Thread Reply:* @Michael Robinson let's wait with release till we solve the intermittent test failing issue https://openlineage.slack.com/archives/C065PQ4TL8K/p1708609977907359
Any idea/preference where to put very specific doc information? For example: if you're running Spark in this specific way, do this. Separate doc page seems like overkill, but I'm not sure where to put something like this where it would be discoverable
*Thread Reply:* Maybe the information could go on a new stub about "Special Cases" or something (not a great title but don't know the use case). That way the page isn't just about the exceptional case?
*Thread Reply:* Can we not trust search?
@Michael Robinson do we know when the next release will be going out? I’d sneak in some feedback this week on: • https://github.com/OpenLineage/OpenLineage/pull/2371 • and some of the circuit breaker work if it’s not too late @Paweł Leszczyński @Maciej Obuchowski 😉
*Thread Reply:* it almost happened today, so tomorrow? 🤞
*Thread Reply:* still provide feedback anyway @Willy Lulciuc!
*Thread Reply:* will do! I’ll get some feedback in tmr
Hey, i made a PR that updates the OL Airflow Provider documentation (removing outdated stuff, moving some from current OL docs, adding new info). It's not a small one, but i think it's worth the time. Any feedback is highly appreciated, let me know if something is missing, is not clear or simply wrong 🙂
@Damien Hawes there have been some random test failures after merging last 2.13 PR, for example https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9437/workflows/4eda8d67-3bd1-4527-84fa-88c19e6774bd/jobs/179622 ```> Task :app:copyIntegrationTestFixtures
Too long with no output (exceeded 10m0s): context deadline exceeded``
hanging on
copyIntegrationTestFixtures`?
*Thread Reply:* Ah, it happened also before latest PR https://app.circleci.com/jobs/github/OpenLineage/OpenLineage/179357
*Thread Reply:* I wonder if the disk is full of that particular executor
*Thread Reply:* Its always failing at the :app:copyIntegrationTestFixtures
step
*Thread Reply:* Because I'm not able to replicate this on my local
*Thread Reply:* yeah I can't replicate that too
*Thread Reply:* I'm rerunning with SSH on CI, will take a look at disk space
*Thread Reply:* OK. I was literally about to edit the CI to run df -H
*Thread Reply:* circleci@ip-10-0-52-168:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 146G 13G 133G 9% /
devtmpfs 7.7G 0 7.7G 0% /dev
tmpfs 7.7G 0 7.7G 0% /dev/shm
tmpfs 1.6G 836K 1.6G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.7G 0 7.7G 0% /sys/fs/cgroup
/dev/nvme0n1p15 105M 6.1M 99M 6% /boot/efi
*Thread Reply:* And if you run df -H /home/circleci/openlineage/integration/spark
*Thread Reply:* circleci@ip-10-0-52-168:~$ df -H /home/circleci/openlineage/integration/spark
Filesystem Size Used Avail Use% Mounted on
/dev/root 156G 15G 142G 10% /
*Thread Reply:* I guess disk usage is increasing, but very slowly?
circleci@ip-10-0-52-168:~$ df /home/circleci/openlineage/integration/spark
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 152243760 14273256 137954120 10% /
circleci@ip-10-0-52-168:~$ df /home/circleci/openlineage/integration/spark
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 152243760 14273436 137953940 10% /
*Thread Reply:* the max memory seems very small?
*Thread Reply:* let's try it? https://github.com/OpenLineage/OpenLineage/pull/2454
*Thread Reply:* ⬆️ ⬆️ it died without filling the disk
*Thread Reply:* Seems to be running again
*Thread Reply:* This one will hang 😐 https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/9441/workflows/4037b224-d6f6-4213-afd5-5d7884007e53/jobs/179711
*Thread Reply:* I'm going to push a change to your branch
*Thread Reply:* To see if we can skip the copy
*Thread Reply:* even before, I reran it with SSH and it managed to copy the dependencies after all
*Thread Reply:* it took a lot of time tho
*Thread Reply:* Could you tell which dependencies it was copying?
*Thread Reply:* Like the fixture dependency
*Thread Reply:* or the container dependencies?
*Thread Reply:* not really, I just looked at df numbers
*Thread Reply:* circleci@ip-10-0-113-50:~$ df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 156G 15G 141G 10% /
devtmpfs 8.3G 0 8.3G 0% /dev
tmpfs 8.3G 0 8.3G 0% /dev/shm
tmpfs 1.7G 857k 1.7G 1% /run
tmpfs 5.3M 0 5.3M 0% /run/lock
tmpfs 8.3G 0 8.3G 0% /sys/fs/cgroup
/dev/nvme0n1p15 110M 6.4M 104M 6% /boot/efi
circleci@ip-10-0-113-50:~$ df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 156G 22G 135G 14% /
devtmpfs 8.3G 0 8.3G 0% /dev
tmpfs 8.3G 0 8.3G 0% /dev/shm
tmpfs 1.7G 1.3M 1.7G 1% /run
tmpfs 5.3M 0 5.3M 0% /run/lock
tmpfs 8.3G 0 8.3G 0% /sys/fs/cgroup
/dev/nvme0n1p15 110M 6.4M 104M 6% /boot/efi
*Thread Reply:* I wonder if the "copyIntegrationTestFixtures" was a red herring
*Thread Reply:* And it's actually the "copyDependencies" step
*Thread Reply:* Because the fixtures JAR is tiny
*Thread Reply:* It should take seconds, at most.
*Thread Reply:* your PR failed on your favorite step, spotless 😂
*Thread Reply:* One of these days, I am going to make a pre-commit or something
*Thread Reply:* we have pre-commit config but it's focused on Python parts https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2969c8f/.pre-commit-config.yaml#L1
*Thread Reply:* spotless is such a low hanging fruit tho...
*Thread Reply:* I did it for one of my local repos
*Thread Reply:* But I didn't get it quite right
*Thread Reply:* Perhaps I should run "spotlessCheck" on commit
*Thread Reply:* At least that way, I know my commit will not be committed if the check fails
*Thread Reply:* https://github.com/jguttman94/pre-commit-gradle
*Thread Reply:* 😞 https://github.com/pre-commit/pre-commit/issues/1110#issuecomment-518939116
*Thread Reply:* Or I should just use intellij to commit with the reformat code option
*Thread Reply:* This is still getting stuck
*Thread Reply:* I wonder if it is the download of the archives
*Thread Reply:* Those archives are "big"
*Thread Reply:* looks like :app:copySparkBinariesSpark350Scala212
that's right after it
*Thread Reply:* 300mb should not take that much anyway
*Thread Reply:* It's the downloading from the Apache archive
*Thread Reply:* https://archive.apache.org
*Thread Reply:* It can be really slow at times
*Thread Reply:* The main archive of all public software releases of the Apache Software Foundation. This is simply a copy of the main distribution directory with the only difference that nothing will be ever removed over here. If you are looking for current software releases, please visit one of our numerous mirrors. Do note that heavy use of this service will result in immediate throttling of your download speeds to either 12 or 6 mbps for the remainder of the day, depending on severity. Continuous abuse (to the tune of more than 40 GB downloaded per week) will cause an automatic ban, so please tune your services to this fact.
🤔
*Thread Reply:* Another reason why we need a container registry
*Thread Reply:* then we'll get rate limited by it 🙂
*Thread Reply:* The problem is the mirrors don't contain all of the versions of Spark
*Thread Reply:* I think they go back to 3.3.4
*Thread Reply:* I think we could start using quay.io : for a task
*Thread Reply:* <a href="http://quay.io">quay.io</a> does not restrict anonymous pulls against its repositories (either public or private) and only rate limits in the most severe circumstances to maintain service levels (e.g. tens of requests per second from the same IP address).
*Thread Reply:* Aye, but who do we speak to in order to provision that?
*Thread Reply:* Maybe, just maybe, we can use circle ci's cache mechanism >.>
*Thread Reply:* I wonder, @Maciej Obuchowski - the GCP project that exists, can we not use the container registry in that one?
*Thread Reply:* quay.io is free, GCR is paid and not that cheap https://cloud.google.com/artifact-registry/pricing
*Thread Reply:* I see this: https://quay.io/plans/
*Thread Reply:* Public repositories are always free.
🙂
*Thread Reply:* I'm setting up an OpenLineage
organization that we could push the images to
*Thread Reply:* Though, I don't know an "organization email" to use
*Thread Reply:* well, I was faster, I got the openlineage
name 🙂
*Thread Reply:* Added one of mine, it's possible to change it later
*Thread Reply:* it should be possible now to log in using
docker login -u="${QUAY_ACCOUNT_ID}" -p="${QUAY_ACCOUNT_TOKEN}" quay.io
from CI task which is marked integration-tests
*Thread Reply:* OK. But I'll need to build those images first, and push them.
*Thread Reply:* and push to https://quay.io/repository/openlineage/spark
*Thread Reply:* as in, the user QUAY_ACCOUNT_ID
has permission to write there
*Thread Reply:* Can you add me to the org, so I can push the images?
*Thread Reply:* (I still have the binaries downloaded on my local)
*Thread Reply:* using email <a href="mailto:damien.hawes@booking.com">damien.hawes@booking.com</a>
?
*Thread Reply:* check the mail 🙂
*Thread Reply:* I can see spark-3.2.4-scala-2.13
*Thread Reply:* Next one is almost there
*Thread Reply:* These are some chunky images.
*Thread Reply:* Earlier I was thinking of pushing it on CI: checking if the Spark tag exist, then create image and push it if it does not
*Thread Reply:* but we need to do this only once
*Thread Reply:* and if we want to add support for another Spark/Scala version, we still need to do some work manually
*Thread Reply:* so I guess this does not matter?
*Thread Reply:* The process is fairly trivial.
*Thread Reply:* but still would be good to have documentation how to create the image, so you're not bothered by questions next time 🙂
*Thread Reply:* And we can make an openlineage-spark-docker
directory
*Thread Reply:* Place the gradle in there
*Thread Reply:* It should be a project that changes very rarely
*Thread Reply:* Good find on the quay.io btw
*Thread Reply:* OK. All images have been pushed.
*Thread Reply:* yep, I can see all of them
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2455 should be enought to check if it works?
*Thread Reply:* > It should be a project that changes very rarely we should be building those images on minor Spark releases too
*Thread Reply:* Yes - that's what the spline folks did
*Thread Reply:* but they never supported 2.13
*Thread Reply:* WARN <a href="http://tc.quay.io/openlineage/spark:spark-3.3.4-scala-2.12">tc.quay.io/openlineage/spark:spark-3.3.4-scala-2.12</a> - The architecture 'arm64' for image '<a href="http://quay.io/openlineage/spark:spark-3.3.4-scala-2.12">quay.io/openlineage/spark:spark-3.3.4-scala-2.12</a>' (ID sha256:dbdc0c8a3e1b182004c3c850c2ecb767b76cc14e55e3e994a34356630e689e86) does not match the Docker server architecture 'amd64'. This will cause the container to execute much more slowly due to emulation and may lead to timeout failures.
*Thread Reply:* you have only arm containers locally?
*Thread Reply:* I'm pushing amd64 containers at the moment
*Thread Reply:* yeah it adds another dimension to the problem
*Thread Reply:* They'll probably take like 15 minutes or so
*Thread Reply:* it would be best if this was multi-arch build I think
*Thread Reply:* anyway it's not a job for now, rerunning the tests and I'm finishing for today
*Thread Reply:* OK - I have extracted the logic for building the images. It now also performs a multi-platform build, targeting linux/amd64
and linux/arm64
. This should be enough for the CI/CD pipeline, folks developing with Linux and folks developing with Apple ARM chips.
The PR is here: https://github.com/OpenLineage/OpenLineage/pull/2456
The images with multiple manifests are here:
https://quay.io/repository/openlineage/spark?tab=tags
*Thread Reply:* To explain the situation for people not following the issue.
We had problem with CI where download of 300MB archive from archive.apache.com took over 10 minutes, probably because we were rate limited. That failed our integration tests and blocked the release.
We used those archives to create docker images that were used for integration tests - compiled Spark of particular version with particular version of Scala.
Solution to that problem was manually prebuilding the images and pushing it to free quay.io repository, This is not a problem, since bumping version of Spark that we test on also requires manual action, and because @Damien Hawes provided Gradle task to complete the work.
I've created openlineage
organization on quay.io where we can push the images - and any other images we could want, for example jupyter already configured with Spark integration to allow people easier experimentation with OpenLineage.
If no one has any philosophical problems with that solution, I would like to see few committer volunteers to be added as admins to the quay.io organization - to increase the bus factor. @Julien Le Dem @Paweł Leszczyński @Damien Hawes - do you want to be added there?
*Thread Reply:* @Maciej Obuchowski - sure.
*Thread Reply:* yes! Thank you Maciej. As long as there’s clear doc and an easy one liner to update those, this sounds good to me. I think we need to pay extra attention on limiting write access as this is a potential injection point to modify what the build does invisibly. (you can push a different image and affect the build without modifying the repo). Is there already a signature verification on download from quay (to avoid unauthorized modification of those images)?
*Thread Reply:* > Could we add add image building and storing in quai.io as part of our CI when needed image is not present there? We discussed this and decided it's already required to do manual work to support additional Spark version, so automating this won't give us much > I would love to have some info in the docs what has to be done to support Spark 3.6 once it gets released. Especially, how can one publish 3.6 image to quai.io? There is a readme here: https://github.com/OpenLineage/OpenLineage/pull/2456/files#diff-44ca475a04d6a92886f82dd27b47d30c8e57f518aa3dbc467feef43ec1c57638
*Thread Reply:* thanks Maciej
*Thread Reply:* > I think we need to pay extra attention on limiting write access as this is a potential injection point to modify what the build does invisibly. (you can push a different image and affect the build without modifying the repo). @Julien Le Dem Yes, only authorized users (committers) can upload images. CI won’t write images, just read them, they would be pushed by committers before execution > Is there already a signature verification on download from quay (to avoid unauthorized modification of those images)? (edited) Docker verifies SHA of downloaded images. Do you mean some additional mechanism to avoid potential problems with compromised committer?
*Thread Reply:* The sha could be saved in the repo and compared so that it can not be changed independently by someone who would have gained access to the credentials.
*Thread Reply:* @Julien Le Dem that would make a lot of sense if the same commit that changes the images could not change the SHA 🙂
*Thread Reply:* unless we've made those SHAs part of something external, for example CircleCI config
*Thread Reply:* but TBH I think it's low risk, CircleCI would limit us fast if somebody would for example put crypto miner there
*Thread Reply:* to me the risk is more to introduce vulnerabilities/backdoors in the OpenLineage released artifact through pushing a cached image that modifies the result of the build.
*Thread Reply:* The idea of saving the image signature in the repo is that you can not use a new image in the build without creating a new commit and traceability.
CFP closes on April 30 https://events.linuxfoundation.org/open-source-summit-europe/
*Thread Reply:* Only a few days after Airflow Summit September 10-12, 2024
🤔
New communication channel: https://medium.com/@openlineageproject
*Thread Reply:* More to come...
I might be the only one running into this issue (for now). I plan on opening PR for a fix this weekend, but if someone wants to pick it up your more than welcome to: https://github.com/OpenLineage/OpenLineage/issues/2458
This is a GCP-specific question, but does anyone know the answer? https://openlineage.slack.com/archives/C01CK9T7HKR/p1708010167626709?thread_ts=1707920807.530409&cid=C01CK9T7HKR
FYI: The release of 1.9.0 did not go through
Issue: https://github.com/OpenLineage/OpenLineage/issues/2467 PR: https://github.com/OpenLineage/OpenLineage/pull/2468
*Thread Reply:* The biggest problem with release, as always, is that you can't test it in other way than running it 😞
It also seems like, despite the fact that the spark step completed, there was a silent failure or something. I don't see the artefacts on central.
*Thread Reply:* @Michael Robinson needs to manually promote them
*Thread Reply:* @Damien Hawes looking into this today
*Thread Reply:* @Michael Robinson you can rerun release now
*Thread Reply:* @Damien Hawes the release is out. comms to follow shortly
Feedback and input requested for this month's newsletter. I've added sections for the Flink and Spark integrations. Please lmk what you think about the "highlights" I've chosen for these and for Airflow if you have a moment between now and EOD Wednesday. Thanks. https://docs.google.com/document/d/15caPR4q7dOPs6co2x0q5hYSX65ZhHHdhKrFP7dVSRPI/edit?usp=sharing
*Thread Reply:* Would it be okay to highlight 2 new committers? 🙂
*Thread Reply:* 👆why I ask for input! thank you @Jakub Dardziński
Snowflake taking a play from the Databricks playbook? https://www.snowflake.com/en/data-cloud/horizon/
gotta skip today meeting. I hope to see you all next week!
The meetup I mentioned about OpenLineage/OpenTelemetry: https://x.com/J_/status/1565162740246671360 I speak in English but other two speakers speak in Hebrew
*Thread Reply:* the slides from my part: https://docs.google.com/presentation/d/1BLM2ocs2S64NZLzNaZz5rkrS9lHRvtr9jUIetHdiMbA/edit#slide=id.g11e446d5059_0_1055
*Thread Reply:* thanks for sharing that, that otel to ol comparison is going to be very useful for me today :)
*Thread Reply:* LGTM 🙂
Hey, I created new Airflow AIP. It proposes instrumenting Airflow Hooks and Object Storage to collect dataset updates automatically, to allow gathering lineage from PythonOperator and custom operators. Feel free to comment on Confluence https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-62+Getting+Lineage+from+Hook+Instrumentation or on Airflow mailing list: https://lists.apache.org/thread/5chxcp0zjcx66d3vs4qlrm8kl6l4s3m2
Hey, does anyone want to add anything here (PR that adds AWS MSK IAM transport)? It looks like it's ready to be merged.
did we miss a step in publishing 1.9.1? going https://search.maven.org/remote_content?g=io.openlineage&a=openlineage-spark&v=LATEST|here gives me the 1.8 release
*Thread Reply:* oh, this might be related to having 2 scala versions now, because I can see the 1.9.1 artifacts
*Thread Reply:* we may need to fix the docs then https://openlineage.io/docs/integrations/spark/quickstart/quickstart_databricks
*Thread Reply:* another place 🙂
*Thread Reply:* https://github.com/OpenLineage/docs/pull/299
Hi, here's a tentative agenda for next week's TSC (on Wednesday at 9:30 PT):
*Thread Reply:* I thought @Paweł Leszczyński wanted to present?
*Thread Reply:* What was the topic? Protobuf or built-in lineage maybe? Or the many docs improvements lately?
*Thread Reply:* I think so? https://github.com/OpenLineage/OpenLineage/pull/2272
*Thread Reply:* Imagine there are lots of folks who would be interested in a presentation on that
*Thread Reply:* There two things worth presenting: circuit breaker +/or built-in lineage (once it gets merged).
*Thread Reply:* updating the agenda
is there a reason why facet objects have _schemaURL
property but BaseEvent
has schemaURL
?
*Thread Reply:* yeah, we use _
to avoid naming conflicts in a facet
*Thread Reply:* same goes for producer
*Thread Reply:* Facets have user defined fields. So all base fields are prefixed
*Thread Reply:* it should be a made more clear… recently ran into the issue when validating OL events
*Thread Reply:* it might be another missing point but we set _producer
in BaseFacet
:
def __attrs_post_init__(self) -> None:
self._producer = PRODUCER
but we don’t do that for producer
in BaseEvent
*Thread Reply:* is this supposed to be like that?
*Thread Reply:* I’m kinda lost 🙂
*Thread Reply:* We should set producer in baseevent as well
*Thread Reply:* The idea is the base event might be produced by the spark integration but the facet might be produced by iceberg library
*Thread Reply:* > The idea is the base event might be produced by the spark integration but the facet might be produced by iceberg library
right, it doesn’t require adding _
, it just helps in making the difference
and also this reason too: > Facets have user defined fields. So all base fields are prefixed > Base events do not
*Thread Reply:* Since users can create custom facets with whatever fields we just tell Them that “_**” is reserved.
*Thread Reply:* So the underscore prefix is a mechanism specific to facets
*Thread Reply:* last question:
we don’t want to block users from setting their own _producer
field? it seems the only way now is to use openlineage.client.facet.set_producer
method to override default, you can’t just do RunEvent(…, _producer='my_own')
*Thread Reply:* The idea is the producer identifies the code that generates the metadata. So you set it once and all the facets you generate have the same
*Thread Reply:* mhm, probably you don’t need to use several producers (at least) per Python module
*Thread Reply:* In airflow each provider should have its own for the facets they produce
*Thread Reply:* just searched for set_producer
in current docs - no results 😨
*Thread Reply:* a number of things will get to the right track after I’m done with generating code 🙂
*Thread Reply:* Thanks for looking into that. If you can fix the doc by adding a paragraph about that, that would be helpful
*Thread Reply:* I can create an issue at least 😂
*Thread Reply:* there you go: https://github.com/OpenLineage/docs/issues/300 if I missed something please comment
I feel like our getting started with openlineage page is mostly a getting started with Marquez page. but I'm also not sure what should be there otherwise.
*Thread Reply:* https://openlineage.io/docs/guides/spark ?
*Thread Reply:* Unfortunately it's probably not that "quick" given the setup required..
*Thread Reply:* Maybe better? https://openlineage.io/docs/integrations/spark/quickstart/quickstart_local
*Thread Reply:* yeah, that's where I was struggling as well. should our quickstart be platform specific? that also feels strange.
Quick question, for the spark.openlineage.facets.disabled
property, why do we need to include [;]
in the value? Why can't we use ,
to act as the delimiter? Why do we need [
and ]
to enclose the string?
*Thread Reply:* There was some concrete reason AFAIK right @Paweł Leszczyński?
*Thread Reply:* We do have a logic that converts Spark conf entries to OpenLineageYaml without a need to understand its content. I think []
was added for this reason to know that Spark conf entry has to be translated into an array.
Initially disabled facets were just separated by ;
. Why not a comma? I don't remember if there was any problem with this.
https://github.com/OpenLineage/OpenLineage/pull/1271/files -> this PR introduced it
https://github.com/OpenLineage/OpenLineage/blob/1.9.1/integration/spark/app/src/main/java/io/openlineage/spark/agent/ArgumentParser.java#L152 -> this code check if spark conf value is of array type
Hi team, do we have any proposal or previous discussion of Trino OpenLineage integration?
*Thread Reply:* There is old third-party integration: https://github.com/takezoe/trino-openlineage
It has right idea to use EventListener, but I can't vouch if it works
*Thread Reply:* Thanks. We are investigating the integration in our org. It will be a good start point 🙂
*Thread Reply:* I think the ideal solution would be to use EventListener. So far we only have very basic integration in Airflow's TrinoOperator
*Thread Reply:* The only thing I haven't really checked out what are real possibilities for EventListener in terms of catalog details discovery, e.g. what's database connection for the catalog.
*Thread Reply:* Thanks for calling out this. We will evaluate and post some observation in the thread.
*Thread Reply:* Thanks Peter Hey Maciej/Jakub Could you please share the process to follow in terms of contributing a Trino open lineage integration. (Design doc and issue ?)
There was an issue for trino integration but it was closed recently. https://github.com/OpenLineage/OpenLineage/issues/164
*Thread Reply:* It would be great to see design doc and maybe some POC if possible. I've reopened the issue for you.
If you get agreement around the design I don't think there are more formal steps needed, but maybe @Julien Le Dem has other idea
*Thread Reply:* Trino has their plugins directory btw: https://github.com/trinodb/trino/tree/master/plugin including event listeners like: https://github.com/trinodb/trino/tree/master/plugin/trino-mysql-event-listener
*Thread Reply:* Thanks Maciej and Jakub Yes the integration will be done with Trino’s event listener framework that has details around query, source and destination dataset details etc.
> It would be great to see design doc and maybe some POC if possible. I’ve reopened the issue for you. Thanks for re-opening the issue. We will add the design doc and POC to the issue.
*Thread Reply:* I agree with @Maciej Obuchowski, a quick design doc followed by a POC would be great. The integration could either live in OpenLineage or Trino but that can be discussed after the POC.
*Thread Reply:* (obviously, adding it to the trino repo would require aproval from the trino community)
*Thread Reply:* Gentleman, we are also actively looking into this topic with the same repo from @takezoe as our base, I have submitted a PR to revive this project - it does work, the POC is there in a form of docker-compose.yaml deployment 🙂 some obvious things are missing for now (like kafka output instead of api) but I think it's a good starting point and it's compatible with latest trino and OL
*Thread Reply:* Thanks for put the foundation for the implementation. Base on it, I feel @Alok would still participate and make contribute to it. How about create a design doc and list all of the possible TBDs as @Julien Le Dem suggested.
*Thread Reply:* Adding @takezoe to this thread. Thanks for your work on a Trino integration and welcome!
*Thread Reply:* throwing the CFP for the Trino conference here in case any one of the contributors want to present there https://sessionize.com/trino-fest-2024
*Thread Reply:* I'm also very happy to help with an idea for an abstract
*Thread Reply:* Hey Harel Just FYI we are already engaged with Trino community to have a talk around Trino open lineage integration and have submitted an Abstract for review.
*Thread Reply:* once you release the integration, please add a reference about it to OpenLineage docs! https://github.com/OpenLineage/docs
*Thread Reply:* I think it's ready for review https://github.com/trinodb/trino/pull/21265 just with API sink integration, additional features can be added at @Alok's convenience as next PRs
Hey, there’s discrepancy between
disabled
option)*Thread Reply:* I believe we should not extract or emit any open lineage events if this option is used
*Thread Reply:* I'm for option 2, don't send any event from task
*Thread Reply:* @Jakub Dardziński do you see any use case for not extracting metadata extraction but still emitting events?
*Thread Reply:* The use case AFAIK was old SnowflakeOperator bug, we wanted to disable the collection there, since it zombified the task. The events being emitted still gave information about status of the task as well as non-dataset related metadata
*Thread Reply:* but I think it's less relevant now
*Thread Reply:* ^ this and you might want to have information about task execution because OL is a backend for some task-tracking system
*Thread Reply:* Hm, I believe users don't expect us to spend time processing/extracting OL events if this configuration is used. It's the documented behaviour
*Thread Reply:* the question is if we should change docs or behaviour
*Thread Reply:* I believe the latter
Hi, here's the
*Thread Reply:* Looks like a great agenda! Left a couple of comments
*Thread Reply:* @Michael Robinson will you be able to facilitate or do you need help?
*Thread Reply:* I'm also missing from the committer list, but can't comment on slides 🙂
*Thread Reply:* Sorry about that @Kacper Muda. Gave you access just now
*Thread Reply:* We probably need to add you to lists posted elsewhere... I'll check
https://github.com/open-metadata/OpenMetadata/pull/15317 👀
*Thread Reply:* this is awesome
*Thread Reply:* it looks like they use temporary deployments to test...
*Thread Reply:* yeah the GitHub history is wild
Hi, I'm at the conference hotel and my earbuds won't pair with my new mac for some reason. Does the agenda look good? Want to send out the reminders soon. I'll add the OpenMetadata news!
*Thread Reply:* I think we can also add the Datahub PR?
*Thread Reply:* @Paweł Leszczyński prefers to present only the circuit breakers
*Thread Reply:* https://github.com/datahub-project/datahub/pull/9870/files
It's been a while since we've updated the twitter profile. Current description: "A standard api for collecting Data lineage and Metadata at runtime." What would you think of using our website's tagline: "An open framework for data lineage collection and analysis." Other ideas?
can someone grant me write access to our forked sqlparser-rs
repo?
*Thread Reply:* I should probably add the committer group to it
*Thread Reply:* I have made the committer group maintainer on this repo
https://github.com/OpenLineage/OpenLineage/pull/2514 small but mighty 😉
Regarding the approved release, based on the additions it seems to me like we should make it a minor release (so 1.10.0). Any objections? Changes are here: https://github.com/OpenLineage/OpenLineage/compare/1.9.1...HEAD
We encountered a case of a START event, exceeding 2MB in Airflow. This was traced back to an operator with unusually long arguments and attributes. Further investigation revealed that our Airflow events contain redundant data across different facets, leading to unnecessary bloating of event sizes (those long attributes and args were attached three times to a single event). I proposed to remove some redundant facets and to refine the operator's attributes inclusion logic within AirflowRunFacet. I am not sure how breaking is this change, but some systems might depend on the current setup. Suggesting an immediate removal might not be the best approach, and i'd like to know your thoughts. (A similar problem exists within the Airflow provider.) CC @Maciej Obuchowski @Willy Lulciuc @Jakub Dardziński
https://github.com/OpenLineage/OpenLineage/pull/2509
As mentioned during yesterday's TSC, we can't get insight into DataHub's integration from the PR description in their repo. And it's a very big PR. Does anyone have any intel? PR is here: https://github.com/datahub-project/datahub/pull/9870
Changelog PR for 1.10 is RFR: https://github.com/OpenLineage/OpenLineage/pull/2516
@Julien Le Dem @Paweł Leszczyński Release is failing in the Java client job due to (I think) the version of spotless: ```Could not resolve com.diffplug.spotless:spotlessplugingradle:6.21.0. Required by: project : > com.diffplug.spotless:com.diffplug.spotless.gradle.plugin:6.21.0
No matching variant of com.diffplug.spotless:spotlessplugingradle:6.21.0 was found. The consumer was configured to find a library for use during runtime, compatible with Java 8, packaged as a jar, and its dependencies declared externally, as well as attribute 'org.gradle.plugin.api-version' with value '8.4'```
*Thread Reply:* @Michael Robinson https://github.com/OpenLineage/OpenLineage/pull/2517
fix to broken main: https://github.com/OpenLineage/OpenLineage/pull/2518
*Thread Reply:* Thanks, just tried again
*Thread Reply:* ? it needs approve and merge 😛
*Thread Reply:* Oh oops disregard
There's an issue with the Flink job on CI:
** What went wrong:
Could not determine the dependencies of task ':shadowJar'.
> Could not resolve all dependencies for configuration ':runtimeClasspath'.
> Could not find io.**********************:**********************_sql_java:1.10.1.
Searched in the following locations:
- <https://repo.maven.apache.org/maven2/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom>
- <https://packages.confluent.io/maven/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom>
- file:/home/circleci/.m2/repository/io/**********************/**********************-sql-java/1.10.1/**********************-sql-java-1.10.1.pom
Required by:
project : > project :shared
project : > project :flink115
project : > project :flink117
project : > project :flink118
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2521
*Thread Reply:* @Jakub Dardziński still awake? 🙂
*Thread Reply:* it’s just approval bot
*Thread Reply:* created issue on how to avoid those in the future https://github.com/OpenLineage/OpenLineage/issues/2522
*Thread Reply:* https://app.circleci.com/jobs/github/OpenLineage/OpenLineage/188526 I lack emojis on this server to fully express my emotions
*Thread Reply:* https://openlineage.slack.com/archives/C065PQ4TL8K/p1710454645059659 you might have missed that
*Thread Reply:* merge -> rebase -> problem gone
*Thread Reply:* PR to update the changelog is RFR @Jakub Dardziński @Maciej Obuchowski: https://github.com/OpenLineage/OpenLineage/pull/2526
https://github.com/OpenLineage/OpenLineage/pull/2520 It’s a long-awaited PR - feel free to comment!
OpenLineage is trending upward on OSSRank. Please vote!
https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ParentRunFacet.json#L20
here the format is uuid
however if you follow logic for parent id in current dbt integration you might discover that parent run facet has assigned value of DAG’s run_id (which is not uuid)
@Julien Le Dem, what has higher priority? I think lots of people are using dbt-ol
wrapper with current lineage_parent_id
macro
*Thread Reply:* It is a uuid because it should be the id of an OL run
where can I find who has write access to OL repo?
*Thread Reply:* Settings > Collaborators and teams
*Thread Reply:* thanks Michael, seems like I don’t have enough permissions to see that
Sorry, I have a dr appointment today and won’t join the meeting
*Thread Reply:* I gotta skip too. Maciej and Pawel are at the Kafka Summit
*Thread Reply:* I hope you’re fine!
Should we cancel the sync today?
looking at XTable today, any thoughts on how we can collaborate with them?
*Thread Reply:* @Julien Le Dem @Willy Lulciuc this reminds me of some ideas we had a few years ago.. :)
*Thread Reply:* hmm.. ok. maybe not that relevant for us, at first I thought this was an abstraction for read/write on top of Iceberg/Hudi/Delta.. but I think this is more of a data sync appliance. would still be relevant for linking together synced datasets (but I don't think it's that important now)
*Thread Reply:* From the introduction https://www.confluent.io/blog/introducing-tableflow/, looks like they are using Flink for both data ingestion and compaction. It means we should at least consider to support hudi source and sink for flink lineage 🙂
Eyes on this PR to add OpenMetadata to the Ecosystem page would be appreciated: https://github.com/OpenLineage/docs/pull/303. TIA! @Mariusz Górski
I really want to improve this page in the docs, anyone wants to work with me on that?
*Thread Reply:* perhaps also make this part of the PR process, so when we add support for something, we remember to update the docs
*Thread Reply:* I free up next week and would love to chat… obviously, time permitting but the page needs some love ❤️
*Thread Reply:* I can verify the information once you have some PR 🙂
RFR: a PR to add DataHub to the Ecosystem page https://github.com/OpenLineage/docs/pull/304
*Thread Reply:* The description comes from the very brief README in DataHub's GH repo and a glance at the code. No other documentation or resources appear to be available.
*Thread Reply:* Do we have any update on this?
*Thread Reply:* sorry, I will add on the following week
Dagster is launching column-lineage support for dbt using the sqlglot parser https://github.com/dagster-io/dagster/pull/20407
*Thread Reply:* I kinda like their approach to use post-hooks
in order to enable column-level lineage so that custom macro collects information about columns, logs it and they parse the log after the execution
*Thread Reply:* it doesn’t force dbt docs generate
step that some might not want to use
*Thread Reply:* but at the same time reuses DBT adapter to make additional calls to retrieve missing metadata
@Paweł Leszczyński interesting project I came across over the weekend: https://github.com/HamaWhiteGG/flink-sql-lineage
*Thread Reply:* Wow, this is something we would love to have (flink SQL support). It's great to know that people around the globe are working on the same thing and heading same direction. Great finding @Willy Lulciuc. Thanks for sharing!
*Thread Reply:* On Kafka Summit I've talked with Timo Walther from Flink SQL team and he proposed alternative approach.
Flink SQL has stable (across releases) CompiledPlan
JSON text representation that could be parsed, and has all the necessary info - as this is used for serializing actual execution plan both ways.
*Thread Reply:* As Flink SQL will convert to transformations before execution, technical speaking our existing solution has already be able to create linage info for Flink SQL apps (not including column lineage and table schemas (that can be inferred within flink table environment)). I will create Flink SQL job for e2e testing purpose.
*Thread Reply:* I am also working on Flink side for table lineage. Hopefully, new lineage features can be released in flink 1.20.
Sessions for this year's Data+AI Summit have been published. A search didn't turn up anything related to lineage, but did you know Julien and Willy's talk at last year's summit has received 4k+ views? 👀
*Thread Reply:* seems like our talk was not accepted, but I can see 9 sessions on unity catalog 😕
finally merged 🙂
pawel-big-lebowski commented on Nov 21, 2023
whoa
I’ll miss the sync today (on the way to data council)
*Thread Reply:* have fun at the conference!
OK @Maciej Obuchowski - 1 job has many stages; 1 stage has many tasks. Transitively, this means that 1 job has many tasks.
*Thread Reply:* batch or streaming one? 🙂
*Thread Reply:* Doesn't matter. It's the same concept.
Also @Paweł Leszczyński, seem Spark metrics has this:
local-1711474020860.driver.LiveListenerBus.listenerProcessingTime.io.openlineage.spark.agent.OpenLineageSparkListener
count = 12
mean rate = 1.19 calls/second
1-minute rate = 1.03 calls/second
5-minute rate = 1.01 calls/second
15-minute rate = 1.00 calls/second
min = 0.00 milliseconds
max = 1985.48 milliseconds
mean = 226.81 milliseconds
stddev = 549.12 milliseconds
median = 4.93 milliseconds
75% <= 53.64 milliseconds
95% <= 1985.48 milliseconds
98% <= 1985.48 milliseconds
99% <= 1985.48 milliseconds
99.9% <= 1985.48 milliseconds
Do you think Bipan's team could potentially benefit significantly from upgrading to the latest version of openlineage-spark? https://openlineage.slack.com/archives/C01CK9T7HKR/p1711483070147019
*Thread Reply:* @Paweł Leszczyński wdyt?
*Thread Reply:* I think the issue here is that marquez is not able to properly visualize parent run events that Maciej has added recently for a Spark application
*Thread Reply:* So if they downgraded would they have a graph closer to what they want?
*Thread Reply:* I don't see parent run events there?
I'm exploring ways to improve the demo gif in the Marquez README. An improved and up-to-date demo gif could also be used elsewhere -- in the Marquez landing pages, for example, and the OL docs. Along with other improvements to the landing pages, I created a new gif that's up to date and higher-resolution, but it's large (~20 MB). • We could put it on YouTube and link to it, but that would downgrade the user experience in other ways. • We could host it somewhere else, but that would mean adding another tool to the stack and, depending on file size limits, could cost money. (I can't imagine it would cost but I haven't really looked into this option yet. Regardless of cost, tt seems to have the same drawbacks as YT from a UX perspective.) • We could have GitHub host it in another repo (for free) in the Marquez or OL orgs. ◦ It could go in the OL Docs because it's likely we'll want to use it in the docs anyway, but even if we never serve it wouldn't this create issues for local development at a minimum? I opened a PR to do this, which a PR with other improvements is waiting on, but not sure about this approach. ◦ It could go in the unused Marquez website repo, but there's a good chance we'll forget it's there and remove or archive the repo without moving it first. ◦ In another repo, or even a new one for stuff like this? Anyone have an opinion or know of a better option?
*Thread Reply:* maybe make it a HTML5 video?
*Thread Reply:* https://wp-rocket.me/blog/replacing-animated-gifs-with-html5-video-for-faster-page-speed/
@Julien Le Dem @Harel Shein how did Data Council panel and talk go?
*Thread Reply:* Was just composing the message below :)
Some great discussions here at data council, the panel was really great and we can definitely feel energy around OpenLineage continuing to build up! 🚀 Thanks @Julien Le Dem for organizing and shoutout to @Ernie Ostic @Sheeri Cabral (Collibra) @Eric Veleker for taking the time and coming down here and keeping pushing more and building the community! ❤️
*Thread Reply:* @Harel Shein did anyone take pictures?
*Thread Reply:* there should be plenty of pictures from the conference organizers, we'll ask for some
*Thread Reply:* Did a search and didn't see anything
*Thread Reply:* Speaker dinner the night before: https://www.linkedin.com/posts/datacouncil-aidatacouncil-ugcPost-7178852429705224193-De46?utmsource=share&utmmedium=memberios|https://www.linkedin.com/posts/datacouncil-aidatacouncil-ugcPost-7178852429705224193-De46?utmsource=share&utmmedium=memberios
*Thread Reply:* haha. Julien and Ernie look great while I'm explaining how to land an airplane 🛬
*Thread Reply:* The photo gallery is there
*Thread Reply:* awesome! just in time for the newsletter 🙂
*Thread Reply:* Thank you for thinking of us. Onwards and upwards.
I just find the naming conventions for hive/iceberg/hudi are not listed in the doc https://openlineage.io/docs/spec/naming/. Shall we further standardize them? Any suggestions?
*Thread Reply:* Yes. This also came up in a conversation with one of the maintainers of dbt-core, we can also pick up on a proposal to extend the naming conventions markdown to something a bit more scalable.
*Thread Reply:* What you think about this proposal? https://github.com/OpenLineage/OpenLineage/pull/1702
*Thread Reply:* Thanks for sharing the info. Will take a deeper look later today.
*Thread Reply:* I think this is similar topic to resource naming in ODD, might be worth to take a look for inspiration: https://github.com/opendatadiscovery/oddrn-generator
*Thread Reply:* the thing is we need to have language-agnostic way of defining those naming conventions and be able to generate code for them, similar to facets spec
*Thread Reply:* could be also an idea to have micro rest api embedded in each client, so managing naming convention would be stored there and each client (python/java) could run it as a subprocess 🤔
*Thread Reply:* we can also just write it in Rust, @Maciej Obuchowski 😁
*Thread Reply:* no real changes/additions, but starting to organize the doc for now: https://github.com/OpenLineage/OpenLineage/pull/2554
@Maciej Obuchowski we also heard some good things about the sqlglot parser. have you looked at it recently?
*Thread Reply:* I love the fact that our parser is in type safe language :)
*Thread Reply:* does it matter after all when it comes to parsing SQL? it might be worth to run some comparisons but it may turn out that sqlglot misses most of Snowflake dialect that we currently support
*Thread Reply:* We'd miss on Java side parsing as well
*Thread Reply:* very importantly this ^
OpenLineage 1.11.0 release vote is now open: https://openlineage.slack.com/archives/C01CK9T7HKR/p1711980285409389
Sorry, I’ll be late to the sync
forgot to mention, but we have the TSC meeting coming up next week. we should start sourcing topics
*Thread Reply:* 1.10 and 1.11 releases Data Council, Kafka Summit, & Boston meetup shout outs and quick recaps Datadog poc update or demo?
*Thread Reply:* Discussion item about Trino integration next steps?
*Thread Reply:* Accenture+Confluent roundtable reminder for sure
*Thread Reply:* job to job dependencies discussion item? https://openlineage.slack.com/archives/C065PQ4TL8K/p1712153842519719
*Thread Reply:* I think it's too early for Datadog update tbh, but I like the job to job discussion. We can make also bring up the naming library discussion that we talked about yesterday
*Thread Reply:* Shared a slide deck with you today. (If anyone else would like access, please lmk!)
*Thread Reply:* Friendly reminder: this month's tsc is tomorrow
one more thing, if we want we could also apply for a free Datadog account for OpenLineage and Marquez: https://www.datadoghq.com/partner/open-source/
*Thread Reply:* would be nice for tests
is there any notion of process dependencies in openlineage? i.e. if I have two airflow tasks that depend on each other, with no dataset in between, can I express that in the openlineage spec?
*Thread Reply:* AFAIK no, it doesn't aim to do reflect that cc @Julien Le Dem
*Thread Reply:* It is not in the core spec but this could be represented as a job facet. It is probably in the airflow facet right now but we could add a more generic job dependency facet
*Thread Reply:* we do represent hierarchy
though - with ParentRunFacet
*Thread Reply:* if we were to add some dependency facet, what would we want to model?
*Thread Reply:* do we also want to model something like Airflow's trigger rules? https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#trigger-rules
*Thread Reply:* I don't think this is about hierarchy though, right? If I understand @Julian LaNeve correctly, I think it's more #2
*Thread Reply:* yeah it's less about hierarchy - definitely more about #2.
assume we have a DAG that looks like this:
Task A -> Task B -> Task C
today, OL can capture the full set of dependencies this if we do:
A -> (dataset 1) -> B -> (ds 2) -> C
but it's not always the case that you have datasets between everything. my question was moreso around "how can I use OL to capture the relationship between jobs if there are no datasets in between"
*Thread Reply:* I had opened an issue to track this a while ago but we did not get too far in the discussion: https://github.com/OpenLineage/OpenLineage/issues/552
*Thread Reply:* oh nice - unsurprisingly you were 2 years ahead of me 😆
*Thread Reply:* You can track the dependency both at the job level and at the run level.
At the job level you would do something along the lines of:
job: { facets: {
job_dependencies: {
predecessors: [
{ namespace: , name: }, ...
],
successors: [
{ namespace: , name: }, ...
]
}
}}
*Thread Reply:* At the run level you could track the actual task run dependencies:
run: { facets: {
run_dependencies: {
predecessor: [ "{run uuid}", ...],
successors: [...],
}
}}
*Thread Reply:* I think the current airflow run facet contains that information in an airflow specific representation: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/plugins/facets.py
*Thread Reply:* I think we should have the discussion in the ticket so that it does not get lost in the slack history
*Thread Reply:* run: { facets: {
run_dependencies: {
predecessor: [ "{run uuid}", ...],
successors: [...],
}
}}
I like this format, but would have full run/job identifier as ParentRunFacet
*Thread Reply:* For the trigger rules I wonder if this is too specific to airflow.
*Thread Reply:* But if there’s a generic way to capture this, it makes sense
Don't forget to register for this! https://events.confluent.io/roundtable-data-lineage/Accenture
This attempt at a SQLAlchemy was basically working, if not perfectly, the last time I played with it: https://github.com/OpenLineage/OpenLineage/pull/2088. What more do I need to do to get it to the point where it can be merged as an "experimental"/"we warned you" integration? I mean, other than make sure it's still working and clean it up? 🙂
https://docs.getdbt.com/docs/collaborate/column-level-lineage#sql-parsing
*Thread Reply:* seems like it’s only for dbt cloud
*Thread Reply:* > Column-level lineage relies on SQL parsing. Was thinking about doing the same thing at some point
*Thread Reply:* Basically with dbt we know schemas, so we also can resolve wildcards as well
*Thread Reply:* but that requires adding capability for providing known schema into sqlparser
*Thread Reply:* that's not very hard to add afaik 🙂
*Thread Reply:* not exactly into sqlparser too
*Thread Reply:* just our parser
*Thread Reply:* yeah, our parser
*Thread Reply:* still someone has to add it :D
*Thread Reply:* some rust enthusiast probably
*Thread Reply:* but also: dbt provides schema info only if you generate catalog.json with generate docs command
*Thread Reply:* Right now we have the dbl-ol wrapper anyway, so we can make another dbt docs command on behalf of the user too
*Thread Reply:* not sure if running commands on behalf of user is good idea, but denoting in docs that running it increases accuracy of column-level lineage is probably a good idea
*Thread Reply:* once we build it
*Thread Reply:* That depends, what are the side effects of running dbt docs?
*Thread Reply:* the other option is similar to dagster's approach - run post-hook macro that prints schema to logs and read the logs with dbt-ol wrapper
*Thread Reply:* which again won't work in dbt cloud - there catalog.json seems like the only option
*Thread Reply:* > That depends, what are the side effects of running dbt docs? refreshing someone's documentation? 🙂
*Thread Reply:* it would be configurable imho, if someone doesn’t want column level lineage in price of additional step, it’s their choice
*Thread Reply:* yup, agreed. I'm sure we can also run dbt docs to a temp directory that we'll delete right after
*Thread Reply:* That's an increase of 17560.5%
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/releases/tag/1.11.3 that’s a lot of notes 😮
*Thread Reply:* the way spark.jars.packages io.openlineage:openlineage_spark:{version}
works, every spark job downloads the jar when it runs.
*Thread Reply:* so that's a cool way to track that.
*Thread Reply:* Just learned about this tool that claims to turn downloads, etc., into data that's more usable for insights into users (as opposed to, say, spark jobs): https://about.scarf.sh/
*Thread Reply:* Hello, my jobs are not downloading the JAR, is there some specific setup needed to enable it ?
Marquez committers: there's a committer vote open 👀
*Thread Reply:* We still need a few more votes if you can spare a moment to vote over there...
did anyone submit a CFP here? https://sessionize.com/open-source-summit-europe-2024/ it's a linux foundation conference too
*Thread Reply:* looks like a nice conference
*Thread Reply:* too far for me, but might be a train ride for you?
*Thread Reply:* yeah, I might submit something 🙂
*Thread Reply:* and I think there are actually direct trains to Vienna from Warsaw
Hmm @Maciej Obuchowski @Paweł Leszczyński - I see we released 1.11.3, but I don't see the artifacts in central. Are the artifacts blocked?
*Thread Reply:* after last release, it took me some 24h to see openlineage-flink artifact published
*Thread Reply:* I recall something about the artifacts had to be manually published from the staging area.
*Thread Reply:* @Maciej Obuchowski - can you check if the release is stuck in staging?
*Thread Reply:* I recall last time it failed because there wasn't a javadoc associated with it
*Thread Reply:* Nevermind @Paweł Leszczyński @Maciej Obuchowski - it seems like the search indexes haven't been updated.
*Thread Reply:* @Michael Robinson has to manually promote them but it's not instantaneous I believe
I'm seeing some really strange behavior with OL Spark, I'm going to give some data to help out, but these are still breadcrumbs unfortunately. 🧵
*Thread Reply:* the driver for this job is running for more than 5 hours, but the job actually finished after 20 minutes
*Thread Reply:* most the cpu time in those 5 hours are spent in openlineage methods
*Thread Reply:* it's also not reproducible 😕
*Thread Reply:* DatasetIdentifier.equals
?
*Thread Reply:* can you check what calls it?
*Thread Reply:* unfortunately, some of the stack frames are truncated by JVM
*Thread Reply:* maybe this has something to do with SymLink and the lombok implementation of .equals() ?
*Thread Reply:* and then some sort of circular dependency
*Thread Reply:* one possible place, looks like n^2 algorithm: https://github.com/OpenLineage/OpenLineage/blob/4ba93747e862e333267b46a57f02a09264[…]rk3/agent/lifecycle/plan/column/JdbcColumnLineageCollector.java
*Thread Reply:* but is this a JDBC job?
*Thread Reply:* ok, we don't use lang3 Pair a lot - it has to be in ColumnLevelLineageBuilder 🙂
*Thread Reply:* yes.. I'm staring at that class for a while now
*Thread Reply:* what's the rough size of the logical plan of the job?
*Thread Reply:* I'm trying to understand whether we're looking at some infinite loop
*Thread Reply:* or just something done very ineffiently
*Thread Reply:* like every input being added in this manner: ``` public void addInput(ExprId exprId, DatasetIdentifier datasetIdentifier, String attributeName) { inputs.computeIfAbsent(exprId, k -> new LinkedList<>());
Pair<DatasetIdentifier, String> input = Pair.of(datasetIdentifier, attributeName);
if (!inputs.get(exprId).contains(input)) {
inputs.get(exprId).add(input);
}
}``
it's a candidate: it has to traverse the list returned from
inputs` for every CLL dependency field added
*Thread Reply:* it looks like we're building size N list in N^2 time:
inputs.stream()
.filter(i -> i instanceof InputDatasetFieldWithIdentifier)
.map(i -> (InputDatasetFieldWithIdentifier) i)
.forEach(
i ->
context
.getBuilder()
.addInput(
ExprId.apply(i.exprId().exprId()),
new DatasetIdentifier(
i.datasetIdentifier().getNamespace(), i.datasetIdentifier().getName()),
i.field()));
🙂
*Thread Reply:* ah, this isn't even used now since it's for new extension-based spark collection
*Thread Reply:* @Paweł Leszczyński this is most likely a future bug ⬆️
*Thread Reply:* I think we're still doing it now anyway:
``` private static void extractInternalInputs(
LogicalPlan node,
ColumnLevelLineageBuilder builder,
List
datasetIdentifiers.stream()
.forEach(
di -> {
ScalaConversionUtils.fromSeq(node.output()).stream()
.filter(attr -> attr instanceof AttributeReference)
.map(attr -> (AttributeReference) attr)
.collect(Collectors.toList())
.forEach(attr -> builder.addInput(attr.exprId(), di, attr.name()));
});
}```
*Thread Reply:* and that's linked list - must be pretty slow jumping all those pointers
*Thread Reply:* maybe it's that simple 🙂 https://github.com/OpenLineage/OpenLineage/commit/306778769ae10fa190f3fd0eff7a6482fc50f57f
*Thread Reply:* There are some more funny places in CLL code, like we're iterating over list of schema fields and calling some function with name of that field :
schema.getFields().stream()
.map(field -> Pair.of(field, getInputsUsedFor(field.getName())))
then immediately iterate over it second time to get the field back from it's name:
List<Pair<DatasetIdentifier, String>> getInputsUsedFor(String outputName) {
Optional<OpenLineage.SchemaDatasetFacetFields> outputField =
schema.getFields().stream()
.filter(field -> field.getName().equalsIgnoreCase(outputName))
.findAny();
*Thread Reply:* I think the time spent by the driver (5 hours) just on these methods smells like an infinite loop?
*Thread Reply:* like, as inefficient as it may be, this is a lot of time
*Thread Reply:* did it finish eventually?
*Thread Reply:* yes... but.. I wonder if something killed it somewhere?
*Thread Reply:* I mean, it can be something like 10000^3 loop 🙂
*Thread Reply:* I couldn't find anything in the logs to indicate
*Thread Reply:* and it has to do those pair comparisons
*Thread Reply:* would be easier if we could see the general size of a plan of this job - if it's something really small then I'm probably wrong
*Thread Reply:* but if there are 1000s of columns... anything can happen 🙂
*Thread Reply:* yeah.. trying to find out. I don't have that facet enabled there, and I can't find the ol events in the logs (it's writing to console, and I think they got dropped)
*Thread Reply:* DevNullTransport 🙂
*Thread Reply:* I think this might be potentially really slow too https://github.com/OpenLineage/OpenLineage/blob/50afacdf731f810354be0880c5f1fd05a1[…]park/agent/lifecycle/plan/column/ColumnLevelLineageBuilder.java
*Thread Reply:* generally speaking, we have a similar problem here like we had with Airflow integration
*Thread Reply:* we are not holding up the job per se, but... we are holding up the spark application
*Thread Reply:* do we have a way to be defensive about that somehow, shutdown hook from spark to our thread or something
*Thread Reply:* there's no magic
*Thread Reply:* circuit breaker with timeout does not work?
*Thread Reply:* it would, but we don't turn that on by default
*Thread Reply:* also, if we do, what should be our default values?
*Thread Reply:* what would not hurt you if you enabled it, 30 seconds?
*Thread Reply:* I guess we should aim much lower with the runtime
*Thread Reply:* yeah, and make sure we emit metrics / logs when that happens
*Thread Reply:* wait, our circuit breaker right now only supports cpu & memory
*Thread Reply:* we would need to add a timeout one, right?
*Thread Reply:* we've talked about it but it's not implemented yet https://github.com/OpenLineage/OpenLineage/blob/3dad978a3a76ea9bb709334f1526086f95[…]o/openlineage/client/circuitBreaker/ExecutorCircuitBreaker.java
*Thread Reply:* and BTW, no abnormal CPU or memory usage?
*Thread Reply:* I mean, it's using 100% of one core 🙂
*Thread Reply:* it's similar to what aniruth experienced. there's something that for some type of logical plans causes recursion alike behaviour. However, I don't think it's recursion bcz it's ending at some point. If we had DebugFacet we would be able to know which logical plan nodes are involved in this.
*Thread Reply:* > If we had DebugFacet we would be able to know which logical plan nodes are involved in this. if the event would not take 1GB 🙂
*Thread Reply:* > it's similar to what aniruth experienced. there's something that for some type of logical plans causes recursion alike behaviour. However, I don't think it's recursion bcz it's ending at some point. If we had DebugFacet we would be able to know which logical plan nodes are involved in this. (edited) what about my thesis that something is just extremely slow in column-level lineage code?
Good news. @Paweł Leszczyński - the memory leak fixes worked. Our streaming pipelines have run through the weekend without a single OOM crash.
*Thread Reply:* @Damien Hawes Would you please point me the PR that fixes the issue?
*Thread Reply:* This was the issue: https://github.com/OpenLineage/OpenLineage/issues/2561
There were two PRs:
*Thread Reply:* @Peter Huang ^
*Thread Reply:* @Damien Hawes any other feedback for OL with streaming pipelines you have so far?
*Thread Reply:* It generates a TON of data
*Thread Reply:* There are some optimisations that could be made:
job start -> stage submitted -> task started -> task ended -> stage complete -> job end
cycle fires more frequently.*Thread Reply:* This has an impact on any backend using it, as the run id keeps changing. This means the parent suddenly has thousands of jobs as children.
*Thread Reply:* Our biggest pipeline generates a new event cycle every 2 minutes.
*Thread Reply:* "Too much data" is exactly what I thought 🙂 The obvious potential issue with caching is the same issue we just fixed... potential memory leaks, and cache invalidation
*Thread Reply:* > the run id keeps changing
In this case, that's a bug. We'd still need some wrapping event for whole streaming job though, probably other than application start
*Thread Reply:* on the other topic, did those problems stop? https://github.com/OpenLineage/OpenLineage/issues/2513 with https://github.com/OpenLineage/OpenLineage/pull/2535/files
*Thread Reply:* So far, we haven't seen anything.
*Thread Reply:* >> the run id keeps changing
> In this case, that's a bug. We'd still need some wrapping event for whole streaming job though, probably other than application start
That could be quite the deviation though, because in our case, the dataset that is being written to keeps changing, as it's partitioned by date and hour.
when talking about the naming scheme for datasets, would everyone here agree that we generally use: {scheme}://{authority}/{unique_name}
? where generally authority
== namespace
*Thread Reply:* I think so, and if we don’t then we should
*Thread Reply:* ~which brings me to the question why construct dataset name as such~ nvm
*Thread Reply:* please feel free to chime in here too https://github.com/dbt-labs/dbt-core/issues/8725
*Thread Reply:* > where generally authority
== namespace
(edited)
{scheme}://{authority}
is namespace
Hey! I’m new to the world of data :) Would love to know the advantages of using open lineage over open metadata, Thanks!
*Thread Reply:* fun fact is that you can use both: https://github.com/open-metadata/OpenMetadata/pull/15317
*Thread Reply:* also, this would be better in #general 🙂
*Thread Reply:* that's also funnier fact that OL doesn't aim to compete with other tools rather than to let them integrate with
*Thread Reply:* Great! So you’d recommend using open metadata as the main platform for metadata collection and integrate it with OL?
*Thread Reply:* I don't have recommendation 🙂 but feel very invited to try out OpenLineage, it's a really great product!
*Thread Reply:* I would say it's not an apples to apples comparison because OpenLineage is a lineage metadata specification and OpenMetadata is a data catalog with a lineage solution. The fact that OpenMetadata and DataHub have recently merged OpenLineage integrations tells you pretty much everything you need to know about where the data world is headed in terms of lineage. 😉 With OpenLineage, you're not bound to one data catalog's set of connectors/extractors, and that's the point of an open and shared spec that's also extensible. I recommend reading the docs and exploring the GitHub repo for more information about the spec and the object model. As Maciej said, you'll probably get more responses to this kind of question in #general . In any case, welcome and have fun exploring!
Apologies in advance for having to leave today's TSC 30 minutes early! Conflict with another meeting
@Paweł Leszczyński - regarding https://github.com/OpenLineage/OpenLineage/issues/2594. I may have a solution, however, I am looking for advice on where to place this code. I need to go quite deep into Spark internals, basically, I need to drop down to the input partition level. I have this POC PR open: https://github.com/OpenLineage/OpenLineage/pull/2600
*Thread Reply:* this code is really weird
*Thread Reply:* create LogicalRDD
from dataframe, then create dataset backed by this RDD?
*Thread Reply:* And break logical lineage in the process, yes
*Thread Reply:* originDataset
is thrown away, right?
*Thread Reply:* okay, I see your solution - definitely something that needs time to spend looking into
*Thread Reply:* regarding proxy, you can also try forceAccess version
MethodUtils.invokeMethod(Object object, boolean forceAccess, String methodName)
*Thread Reply:* No need. They are public methods
*Thread Reply:* As it's a Scala case class
*Thread Reply:* @Paweł Leszczyński - any thoughts on this?
*Thread Reply:* @Paweł Leszczyński - I guess it could fit into there, but this is more about matching on a LogicalRDD
in the first place. We already have LogicalRDD
input dataset builders, but they're specific to HadoopFS based things.
I'm thinking about cracking that open and making it a bit more generic, by allowing it to accept strategies for different types of LogicalRDDs, and delegating to a strategy per logical RDD type.
This is a potential rabbit hole though.
*Thread Reply:* I am also looking into LogicalRDDVisitor
*Thread Reply:* to check if this should fit that visitor
*Thread Reply:* Again, it could. The thing is that visitor looks for files:
@Override
public List<D> apply(LogicalPlan x) {
LogicalRDD logicalRdd = (LogicalRDD) x;
List<RDD<?>> fileRdds = Rdds.findFileLikeRdds(logicalRdd.rdd());
return findInputDatasets(fileRdds, logicalRdd.schema());
}
Though, that isn't to say we can't change it.
*Thread Reply:* It looks really hacky but perhaps this is the only way to go with this. Before going this way, I would check if there is no Spark action before that creates this LogicalRDD. If so, we glue two logical plans which would be better in my opinion.
*Thread Reply:* The approach I was thinking was:
LogicalRddInputDatasetBuilder
*Thread Reply:* Because LogicalRDDs are always leaves.
*Thread Reply:* but something needs to create rdd
first
*Thread Reply:* Aye - the thing is, when it's foreachBatch, we don't know what it is, in this case.
*Thread Reply:* We can't see that it came from "foreachBatch"
*Thread Reply:* Isn't it a Spark action that is run within other Spark action? (sry if asking silly questions)
*Thread Reply:* I guess you could see it that way, the problem is spark breaks the logical lineage
*Thread Reply:* but retains the physical lineage
Hi - I'm trying to build a custom extractor and running into an issue:
Broken plugin: [airflow.providers.openlineage.plugins.openlineage] 'NoneType' object has no attribute 'get_operator_classnames'
I can't tell if I'm placing my extractor in the correct location. I've placed it in plugins/custom_extractor/CustomExtractor.py
I've set this env variable AIRFLOW__OPENLINEAGE__EXTRACTORS: 'custom_extractor.CustomExtractor.PythonOLExtractor'
Any ideas here? I can't figure it out from the documentation
*Thread Reply:* I think the issue is that you have some circular dependency somewhere, that breaks the import - that's usually how it's manifested. Other than that it would help if you could share the code
*Thread Reply:* can you check if you have the same issue if you do not import anything from Airflow or OpenLineage in your custom extractor? PythonOLExtractor
*Thread Reply:* @Maciej Obuchowski I believe when I get rid of these imports, things start to work. Can't confirm because then i have a break where Dataset is not defined etc
from openlineage.airflow.extractors import TaskMetadata
from openlineage.airflow.extractors.base import BaseExtractor
from openlineage.client.run import Dataset
*Thread Reply:* Do I have to get rid of these imports and manually build the dataset
*Thread Reply:* @Tom Linton try to move imports locally, to the extract_on_complete
- and leave them behind typing.TYPE_CHECKING
at top level
*Thread Reply:* something like that
```from typing import Optional, List, TYPE_CHECKING
if TYPECHECKING: from openlineage.airflow.extractors import TaskMetadata from openlineage.airflow.extractors.base import BaseExtractor from openlineage.client.run import Dataset from customop.CustomPythonOperator import InsAndOuts
def create_dataset(datasets: List[InsAndOuts]) -> List[Dataset]: from openlineage.client.run import Dataset return [Dataset(namespace=item.connection, name="{}.{}.{}".format(item.db, item.schema, item.table), facets={} ) for item in datasets]
class PythonOLExtractor(BaseExtractor):
@classmethod
def get_operator_classnames(cls) -> List[str]:
return ['CustomPythonOperator']
def extract(self) -> Optional[TaskMetadata]:
pass
def extract_on_complete(self, task_instance) -> Optional[TaskMetadata]:
from openlineage.airflow.extractors import TaskMetadata
task = task_instance.task
inputs = self.operator.get('input_data')
output = self.operator.get('output_data')
return TaskMetadata(
name=task.task_id,
inputs=create_dataset(inputs),
outputs=create_dataset(output)
)```
*Thread Reply:* can you paste current version or is that it? ⬆️
*Thread Reply:* Current version of my code was just a copy and paste of your message
*Thread Reply:* ah, okay, that was quick example in the notepad
*Thread Reply:* ```def create_dataset(datasets): from openlineage.client.run import Dataset return [Dataset(namespace=item.connection, name="{}.{}.{}".format(item.db, item.schema, item.table), facets={} ) for item in datasets]
class PythonOLExtractor: def init(self, operator): super().init() self.operator = operator
@classmethod
def get_operator_classnames(cls):
return ['CustomPythonOperator']
def extract(self):
pass
def extract_on_complete(self, task_instance):
from openlineage.airflow.extractors import TaskMetadata
task = task_instance.task
inputs = self.operator.get('input_data')
output = self.operator.get('output_data')
return TaskMetadata(
name=task.task_id,
inputs=create_dataset(inputs),
outputs=create_dataset(output)
)```
maybe try this 🙂
*Thread Reply:* Errors gone, but the operation is failing due to no module names openlineage.airflow
*Thread Reply:* do I need to pip install openlineage?
*Thread Reply:* What version of airflow you're on? The answer depends
*Thread Reply:* Then you need to import from apache.airflow.providers.openlineage
*Thread Reply:* Not openlineage.airflow
and install apache-airflow-providers-openlineage
*Thread Reply:* I'm dying a slow death here 🫠
Now TaskMetadata is not part of airflow.providers.openlineage
*Thread Reply:* yeah, we renamed it OperatorLineage
🙂
https://github.com/apache/airflow/blob/9c4e333f5b7cc6f950f6791500ecd4bad41ba2f9/airflow/providers/openlineage/extractors/base.py#L34
> I'm dying a slow death here 🫠 good example why not to give advice from the phone, you rarely give one without bugs
*Thread Reply:* from airflow.providers.openlineage.extractors import OperatorLineage
*Thread Reply:* I really appreciate all the help you've given me!
*Thread Reply:* Hello, did you manage to get it working? I tried all the things above but still getting Broken plugin: [airflow.providers.openlineage.plugins.openlineage] 'NoneType' object has no attribute 'get_operator_classnames'
😞
*Thread Reply:* ```from typing import List, TYPE_CHECKING
if TYPE_CHECKING: from openlineage.airflow.extractors.base import BaseExtractor
class DBTExtractor(BaseExtractor): @classmethod def getoperatorclassnames(cls) -> List[str]: return ["DbtRunKubernetesOperator"]
def _execute_extraction(self):
from airflow.providers.openlineage.extractors import OperatorLineage
from openlineage.client.run import Dataset
input_dataset = Dataset(
namespace="Extractors",
name=f"{self.operator}_Extractors_in",
)
output_dataset = Dataset(
namespace="Extractors",
name=f"{self.operator}_Extractors_out",
)
return OperatorLineage(
inputs=[input_dataset],
outputs=[output_dataset],
)```
*Thread Reply:* If you use the BaseExtractor at top level, you can gate the import it behind the TYPE_CHECKING
*Thread Reply:* hello, as you can see, already am 🙂
*Thread Reply:* Ah, I wrote opposite of what I ment. I mean you can't do that, because you're actually using this in class definition 🙂
@Damien Hawes are you using intellij? If yes, have you solved the issue with intellij not recognizing dependencies between projects after each Gradle refresh? I can add them manually, but it gets dropped when intellij picks dependencies from Gradle again. I think it's because intellij does not respect the particular configuration we're using for dependencies now
*Thread Reply:* I filed an issue with JetBrains a long time ago.
*Thread Reply:* Ultimately, it's IntelliJ's module system not playing nicely with Gradle
*Thread Reply:* It's an annoying thing, that's for sure.
*Thread Reply:* I see, so just like my other issue: https://youtrack.jetbrains.com/issue/IDEA-140707 🙂
If you're using 1.12.0, how are things going? Curious if you're seeing improvements... (cc @Maciej Obuchowski)
*Thread Reply:* they keep making thin wrappers around OpenLineage
*Thread Reply:* ```import static io.openlineage.spark.agent.util.ScalaConversionUtils.asJavaOptional;
import io.openlineage.client.Environment; import io.openlineage.client.OpenLineage; import io.openlineage.spark.agent.ArgumentParser; import io.openlineage.spark.agent.EventEmitter; import io.openlineage.spark.agent.JobMetricsHolder; import io.openlineage.spark.agent.Versions; import io.openlineage.spark.agent.lifecycle.ContextFactory; import io.openlineage.spark.agent.lifecycle.ExecutionContext; import io.openlineage.spark.agent.util.ScalaConversionUtils;```
*Thread Reply:* /**
** This code has been referenced from
** <https://github.com/Natural-Intelligence/openLineage-openMetadata-transporter.git>
**/
*Thread Reply:* I think the repo needs a NOTICE file
*Thread Reply:* They are literally the second datahub 🙂 Why contribute to opensource project if you can fork it
@Julien Le Dem @Michael Robinson @tati should we keep some initial agenda for Tuesday meetings, maybe just lightweight as a slack thread?
We can try here: for the next Tuesday 23rd April
*Thread Reply:* First point, discuss moving docs to the monorepo - the gains would be • to have docs in the same PR that the feature/bugfix is • @Michael Robinson would not need to copy the changelog • we can have doc versioning: the release doc version would be tied to the tag for the particular release
*Thread Reply:* For Marquez we did this for the landing pages and docs, and while we haven't taken advantage of the ability to implement versioning yet, there haven't been any problems that I'm aware of.
*Thread Reply:* Another point for tomorrow: • discussing maintaining milestones and assignees for OL tickets to increase visibility of what people work on or plan to work on . We mentioned it last week but @Harel Shein and @tati were missing or not able to talk.
*Thread Reply:* View vs table: • When reading from view or materialized view, should we reference view or unterlying table (tables?) that the view is based on • https://github.com/trinodb/trino/pull/21265#discussion_r1576169364
@Jakub Dardziński thanks for massive update to generated.schema in Python, I've started to beta test it but found one issue: https://github.com/OpenLineage/OpenLineage/issues/2629
*Thread Reply:* I don’t have strong opinion on this. If it would enable bi-directional operations on serialization then probably we should do this. The only reason _schemaURL
is not ClassVar is to include it in __attrs_attrs__
*Thread Reply:* btw, what do you use to deserialize JSONs back to classes? cattrs or something else?
*Thread Reply:* cc @Maciej Obuchowski
*Thread Reply:* we can use cattrs converters cattr.Converter().structure(_data, RunEvent)
if we can do sth about type hints that look like this str | None
that currently break this approach
https://github.com/OpenLineage/OpenLineage/pull/2632 fixes the main build
Hi, Do we officially support datahub as a consumer for OpenLineage? Do we have any docs on the integration?
Datahub added OL compatible API endpoint just a few weeks ago, in v0.13.1 https://datahubproject.io/docs/releases/#whats-changed
*Thread Reply:* https://openlineage.slack.com/archives/C065PQ4TL8K/p1708517956104609 🙂
Hi channel! I have posted this today on #general but I think this may be a more appropriate place: https://openlineage.slack.com/archives/C01CK9T7HKR/p1713866218368999
Basically the issue that we're facing is that we may run the same dbt project from an multistep Airflow DAG - but with different model selectors or configurations.
With the current dbt-ol
logic, the dbt wrapper job name ends up being always in the format dbt-run-<dbt_project_name>
. That's a problem if you share the same dbt project across multiple tasks in the same DAG - maybe using different models, which can point to different dependencies, so you may want those things to be captured in the graph.
My proposal (/workaround) would be to enrich job_name
in dbt-ol
using the format dbt-run-$OPENLINEAGE_PARENT_ID-<dbt_project_name>
, if OPENLINEAGE_PARENT_ID
is set. So different Airflow tasks that launch the same project won't have name clashes on the job name.
The other idea that crossed my mind (which is probably more general-purpose) would be to have a generic OPENLINEAGE_NAME_TEMPLATE
variable, which would allow us to specify something in the format e.g. dbt-run-${OPENLINEAGE_PARENT_ID}.${JOB_NAME}
without trying to come up with a one-size-fits-all solution. What are your thoughts?
@Maciej Obuchowski after our sync, can you help me merge this in? I might not have the right permissions on that repo
*Thread Reply:* I don't too 😞
Merging is blocked
The base branch does not allow updates. Learn more about protected branches.
*Thread Reply:* only the main OL repo
*Thread Reply:* @Julien Le Dem you can make me maintainer of this repo as well 🙂
*Thread Reply:* The committers group is maintainer as well
*Thread Reply:* hmm.. I can't change base branch protections or override them
*Thread Reply:* only admins can bypass branch protections. Not maintainers
*Thread Reply:* maybe it's just that a pr is not the right way to do this?
*Thread Reply:* I used the fork syncing feature and that created this PR
*Thread Reply:* ah, I see. I'm looking at that doc
*Thread Reply:* What are the 18 commits ahead?
*Thread Reply:* i see, I think only Pawel merged PRs before
*Thread Reply:* those are the commits to add a notice to the readme and merge update PRs
*Thread Reply:* The way branch protection is setup, I think you need an admin to merge everytime, because it's not just the branch from upstream.
*Thread Reply:* I think, the way we handle the fork, we should uncheck that read-only check box
*Thread Reply:* yes, let's uncheck this 🙂
The 1.13.0 changelog PR is RFR if anyone has a moment: https://github.com/OpenLineage/OpenLineage/pull/2638
Hi folks, was there any discussion regarding the beam integration since this issue that I missed? I'm working on beam/dataflow lineage for my team and will be appreciated if I get some help for brainstorming the idea.
*Thread Reply:* I don't think so, but as a former Beam user - I'd be down to help!
*Thread Reply:* @Dominik Dębowczyk might have some experience with Beam as well
Hello, I am trying to use the OL integration with Airflow and Marquez. I am trying to graph workflow lineage within Marquez by using the ParentRunFacet to obtain upstream and downstream task dependencies however, I am not clear on how to send this information to Marquez so that the task hierarchy is visualized in the UI.
*Thread Reply:* Hey Catherine, please post the question in Marquez Slack, there are probably more people that could help you 🙂
Hi, today we've been discussing with the team about the dbt job name disambiguity / templating options (context: https://openlineage.slack.com/archives/C01CK9T7HKR/p1713866218368999).
The job name display / templating path would probably fix the same-dbt-project-different-model issue, but (unlike the job name prefix/suffix intermediate solutions) it also resonates quite well with another issue that we're currently trying to solve (i.e. job name explosion / intelligibility), which @Maciej Obuchowski briefly mentioned on that thread.
Shall we organize a meeting to discuss these options, as it may probably be a quicker way to reach consensus? Or shall I bring it up to tomorrow's committer sync general discussion?
(cc @Arnab Bhattacharyya)
I have a conflict with the committer sync tomorrow, but wanted to raise something that someone asked me about today. What do folks here think about emitting OL events as OpenTelemetry traces? If so, where would you implement that? Motivation is that there are a bunch of tools built around opentelemetry that can leverage this integration and get some immediate value.
*Thread Reply:* I’m not sure about it myself, so I wanted to get more opinions as well
*Thread Reply:* I have an item, too, if I can hijack this thread 🙂. I'd like to propose automating the selection process for the newsletter's contributor of the month and shifting the focus to a PR rather than a contributor, perhaps using Airflow's PR of the month script.
*Thread Reply:* feedback here: we would love to be able to use lineage as otel traces so we can mix lineage with tracing. this can help in a pipeline that starts with spark/flink and continues in to output processing in a different pipeline which is more kafka and backend based
*Thread Reply:* Hey @Iftach Schonbaum if you have a specific use case in mind and are willing to work on this, happy to collaborate!
Hi, as discussed during today's committer sync, we're sunsetting the Contributor of the Month feature of the newsletter for the time being and adding an automated PR of the Month feature in its place. Airflow uses a script for this use case that IMO does a pretty good job selecting candidates (we could also create our own script, of course, or fork Airflow's and tweak it if someone wants to take that on). Applied to OpenLineage, the script identifies the following as the top five PRs for the month: Top 5 out of 68 PRs:
Next month I'll give folks more time to vote, but if you have time by 3 PM ET tomorrow would you please vote in the thread? Absent votes we'll just go with #1 unless there are objections. 🙂
*Thread Reply:* how about reactions like when release voting?
*Thread Reply:* sure, works for me
*Thread Reply:* PR #2520: python/client: generate Python facets from JSON schemas.. Score: 199.211
*Thread Reply:* ** PR #2510: sqlparser: update code to conform to upstream sqlparser-rs changes. Score: 46.9
*Thread Reply:* PR #2548: [SPEC] Allow nested struct fields in SchemaDatasetFacet. Score: 40.5
*Thread Reply:* PR #2572: [CLIENT] Fix missing pkg_resources module on Python 3.12. Score: 35.0
*Thread Reply:* PR #2578: [Airflow] Fixed format returned by airflow.macros.lineage_parent_id
.. Score: 33.6.
*Thread Reply:* I’m voting vor 2548. @dolfinus went beyond his main language and contributed to Java code generator (which wasn’t easy)
*Thread Reply:* It's a tie! Thanks for voting
Hi all, starting a thread to collect agenda items for this month's TSC meeting, which is next Wednesday at 9:30 am PT. Please reply here with your items.
*Thread Reply:* Release: 1.13.1
*Thread Reply:* Events: Open Standards for Data Lineage panel
*Thread Reply:* issue templates, label updates in GH (Kacper)
*Thread Reply:* protobuf update (Pawel)
Hi folks, I've put together a couple of ideas for the job name override problem (specifically taking the Airflow+dbt case as an example but it can have wider applications): https://paste.manganiello.tech/?d28179d0c3f3cb07#9TWwtxmsMZ2o7c3EgpxEKQLgQGbSb4QLNdjK9krYG56w. Let me know your thoughts and whether I should publish it as a Github discussion or through other means (e.g. Google doc) (cc @Arnab Bhattacharyya)
I've got a question about the difference between io.openlineage.client
and io.openlineage.server
- the server
classes lack the structure of the client classes. For example, if we look at RunFacets
, the client
definition contains:
ProcessingEngineRunFacet
NominalTimeRunFacet
etcThe server
version doesn't. This has an implication when you deserialise the events as the server
version. Specifically, all information in the facets gets shunted into the additionalProperties
field. So I'm just curious as to what is the purpose of the server
classes?
*Thread Reply:* the server classes are meant for receiving OL event. Typically a server will accept any facet, including future versions of standard facets. So it deserializes the facets in a generic way. As opposed to the client where the purpose is to generate facet against the current version of the schema.
*Thread Reply:* Aye - I understand that. It's more around the implication. The server
models lose a lot of type information. Compare and contrast the following.
Using this snippet of code:
``` var serverEvent = readResource( "test-events.ndjson", io.openlineage.server.OpenLineage.RunEvent.class).get(0);
var clientEvent =
readResource(
"test-events.ndjson",
io.openlineage.client.OpenLineage.RunEvent.class).get(0);```
The first screenshot shots the structure of the server event and the second screenshot shows the structure of the client event.
*Thread Reply:* Important thing to note: in the server event, all facets are located in the additionalProperties
of the facets
object, and all properties of any individual facet is located within its own additionalProperties
property.
*Thread Reply:* Whilst the client
model retains that typing information.
*Thread Reply:* This does imply that a (Java) server deserialising these models and using the facet information will have to perform a lot of casts.
*Thread Reply:* Yes, this is true. I've been thinking of a better way to do this. Possibly something like this:
OutputDatasetFacet outputStatistics = readServer.getOutputs().get(0).getOutputFacets().getAdditionalProperties().get("outputStatistics"); OutputStatisticsOutputDatasetFacet translated = mapper.convertValue(mapper.valueToTree(outputStatistics), OutputStatisticsOutputDatasetFacet.class);
*Thread Reply:* If this is a satisfying solution, we could add it to the server model. Something like: > getOutputFacets().getFacet("outputStatistics", OutputStatisticsOutputDatasetFacet .class)
*Thread Reply:* The server model has not been used much so far and can be improved for sure.
*Thread Reply:* another option is to have all the same get{name}Facet as the client model but deserialize from the Map when it happens. It might help if we make additional properties values as JsonNode and not just Objects.
I've opened a PR for the dbt job name duplication issue: https://github.com/OpenLineage/OpenLineage/pull/2658. It doesn't implement the general-purpose name templating solutions I've discussed in my proposal, instead it just adds dbt profile and model(s) to the name if available. It also adds a new environment variable to control the new behaviour (OPENLINEAGE_DBT_USE_EXTENDED_JOB_NAME
) in order to minimize the risk of breaking back-compatibility - i.e. the same workflow run before and after the upgrade may generate different job names. Wondering if this is actually required or if we're ok with having the new syntax rolled out to everyone.
This is my proposal on how we should update labels on GitHub so that they are more useful. I don't need help with the operation itself, but would like some feedback before i start 🙂 . I will leave it here for a while, if you think it's a good idea, leave a ➕ , else - leave a comment in the docs.
https://docs.google.com/spreadsheets/d/1WYt0IHc2N5pWhMSfN7vlFaXgssV_pdAlL0UxAo5yblE/edit?usp=sharing
I have a conflict for today's meeting and won't be able to join
I have a question about environment-properties
facet. It contains different variables for different environment types, e.g. spark.databricks.clusterUsageTags.clusterName
for Databrics, or spark.cluster.name
for GCP, but it also can contain OS environment variables like SOME_ENV
. Shouldn't this be splitted into different facets, e.g. spark_databricsInfo
, spark_gcpInfo
, spark_envVars
?
We need to get a release out today, according to our policy. @Michael Robinson will you be able to do it?
*Thread Reply:* FYI: https://github.com/OpenLineage/OpenLineage/pull/2683
*Thread Reply:* I would not guess it, if whole Snowflake syntax was added
Hey all, 2 things from a catch up with @Jens Pfau & @Natalia Gorchakova today (feel free to add more details if I'm missing anything):
*Thread Reply:* 1. The current proposal is to put it under OpenLineage/spec/registry
in the same repo and use CODEOWNERS file to delegate access. Would that work?
*Thread Reply:* 1. I think that works too, as long as we have the codeowners file.
*Thread Reply:* 2. I remember this discussion now. Do we want to capture this as a facet? It does have semantics close to the column lineage facet. Only at the table level
*Thread Reply:* we can create TableDependenciesDatasetFacet
? It could exist also, if we want to have duplication with CLL facet
*Thread Reply:* and we don't always attach CLL facet, but if we go with additional facet we'd need to always attach TableDependenciesDatasetFacet
- we don't want OL consumers to guess whether they need to use one facet or the other depending on one's existence
*Thread Reply:* I think OpenLineage/spec/registry
works. @Natalia Gorchakova to chime in if you disagree.
*Thread Reply:* If we add something like a TableDependenciesDatasetFacet
, the consumer would need to reconcile information of inputs and outputs with the content of this facet. Might not be ideal?
But I don't have a better idea either 😞
*Thread Reply:* OpenLineage/spec/registry
it is! I'll start a PR with the folder structure, but we need to get the proposal merged in first.
a few comments to address there still
@Julien Le Dem if you have some time?
*Thread Reply:* @Harel Shein should we also define the name under which the facet should be reported? i.e. https://github.com/OpenLineage/OpenLineage/blob/e88aaa147edfebd7d1f399b554b846d951[…]ntegration/spark/shared/facets/spark/v1/logical-plan-facet.json is reported under name spark.logicalPlan
*Thread Reply:* for example, we can: • for each facet definition define the name that should be used to report it • force the naming convention (at least the facet name should start from top level name)
*Thread Reply:* @Harel Shein @Julien Le Dem I tried to map the proposed facet registry schema to GCP needs. And that's what i got: name: "gcp" with GcpJobFacet / GcpRunFacet as consumer name "gcpdataproc" with GcpDataprocJobFacet / GcpDataprocRunFacet as producer name "gcpSYSTEM" with GcpSYSTEMJobFacet / GcpSYSTEMRunFacet as producer Is that the way you had in mind?
*Thread Reply:* so, something like this for the registry.json
?
```{ "producer": { "rootdocurl": "https://google_dataproc_docs", "producedfacets": [ "ol:gcpdataproc:GcpDataprocJobFacet.json", "ol:gcp_dataproc:GcpDataprocRunFacet.json", ....
]
},
"consumer": { "rootdocurl": "https://google_dataplex_docs", "consumedfacets": [ "ol:gcpdataplex:GcpJobFacet.json", "ol:gcp_dataplex:GcpRunFacet.json", ....
]
} }``` ?
*Thread Reply:* Sorry for the late reply. This was the week I was travelling. That sounds good to me. I commented on the last piece of this in your PR @Natalia Gorchakova
Should we drop Airflow 2.1? It's not supported by any cloud provider anymore.
*Thread Reply:* Maybe we should go for both 2.1 and 2.2 at the same time since both do not have listener API and a very limited Ol functionality?
I can't attend today's committer sync because of a conflict with a team offsite. See you next week!
👀 Next week's Marquez community meeting will feature a demo of a new data quality feature in the UI. Get the meeting link here: http://bit.ly/MarquezMeet
OpenLineage should be the top reply here 😉 https://www.reddit.com/r/dataengineering/comments/1cvmerf/data_lineage_tools/
*Thread Reply:* I put my 🆙 let's be at at the top! 🙂
*Thread Reply:* • How do we deprecate non-standard facets ◦ When we have 1-1 replacement ◦ When we don't?
*Thread Reply:* Related:
Does somebody know if these talks/blog posts are available somewhere else so that we can repair the redirection? https://github.com/OpenLineage/docs/pull/325/files I think it would be good to leave it there event without the URL's just to show that there are people from OL actively participating in events. WDYT?
*Thread Reply:* The cross-platform one: https://youtu.be/rO3BPqUtWrI?si=Yu7oenQY2RhkERhx
*Thread Reply:* We could link to the 2021 Berlin Buzzwords slides if there's not a video: https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/Data%20Pipelines%20Observability%20with%20OpenLineage.pdf
*Thread Reply:* Agree with your idea to keep items even if we can't link out to recordings, etc. When we redesign this, we could make it less obvious whether or not there are links...
Does anyone remember a discussion of adding a contributors doc to the repo? I fear that I volunteered to work on this but only vaguely remember the details. So I'm thinking of revising this to list organizations and companies rather than individuals. If you remember the discussion or have opinions about this, would you please let me know?
*Thread Reply:* I opened a PR to rewrite the doc as a table for acknowledging orgs.
Is anyone available to give this a review? It changes around 100 files across the project (to update the copyright year): https://github.com/OpenLineage/OpenLineage/pull/2712
As discussed multiple times, I think the time has come to drop Airflow 2.1 and 2.2 support https://github.com/OpenLineage/OpenLineage/pull/2710. Feel free to leave any thoughts in the PR 🙂
Friendly reminder: this month's Marquez meeting is tomorrow https://openlineage.slack.com/archives/C065PQ4TL8K/p1715899030556759
Anyone know of any reason we can't merge release commits via PR rather than pushing to main, which evidently is no longer permitted? I can't think of a reason why it wouldn't work but would prefer to avoid a mess if possible. @Julien Le Dem @Maciej Obuchowski @tati @Harel Shein
*Thread Reply:* hmm as the release dev, you should be able to apply commits to main
(or at least within the window of the release)
*Thread Reply:* yeah, I wonder if some settings have been changed?
*Thread Reply:* this might be related to what we discussed about doing for the registry?
*Thread Reply:* a few weeks ago we change that setting in the weekly meeting. I believe Maciej missed that one. I put it back. Maciej and Willy are admins and can tweak the branch protection rules as needed.
*Thread Reply:* The mvn release plugin does push to main as it updates the current version to {next release}-snapshot
To clarify my post above: releases are currently blocked because it seems that a new branch protection is keeping me from pushing to main as our current release process requires. To unblock releases, it seems to me we need either: • a new role for the release dev with the necessary privileges for bypassing the protection • a new process (e.g., a series of PRs for the commits generated by the script) I'm not suggesting we remove the protection. But I do think we need an agreed-upon fix to either the process or the permissions in GH. In my haste to get the release out I opened a PR (since closed). Merging that particular PR wouldn't have unblocked the release because it had both the dev version snapshot commit and the release version commit. Regardless, IMO, a hasty workaround isn't the correct way to fix this.
I have a conflict at 9am and will miss the second half of the meeting tomorrow.
Agenda for today's committer meeting: • Improvements for the release process
*Thread Reply:* we can also follow up on the registry
*Thread Reply:* sorry I missed the sync today, did you get to talk about the registry?
*Thread Reply:* @Harel Shein yes, briefly 😉
*Thread Reply:* Sorry. I missed the committer meeting. Is there a meeting calendar I can subscribe?
*Thread Reply:* added you to the distro, lmk if you don't see it
It's time to vote for this month's PR of the Month! Applied to OpenLineage, Airflow's script for this use case identifies the following as the top five PRs for the month:
Top 5 out of 53 PRs: > * PR #2228: [PROPOSAL #2161] Add a Registry of Producers and Consumers in OpenLineage. > * PR #2658: Support a less ambiguous logic to generate job names. > * PR #2677: Spark: Add facets to Spark Application events. > * PR #2719: Spark: Add jobType facet to Spark application events. > ** PR #2693: python: use v2 Python facets. Would you please vote in the thread by 3 PM ET on Friday? Absent votes I'll go with #1.
*Thread Reply:* Same with my vote. Registry is cool.
*Thread Reply:* plus one for the registry
@Mariusz Górski congrats on getting this PR https://github.com/trinodb/trino/pull/21265 merged!
*Thread Reply:* Oooooh. That'd be a great blog post on the OL blog!
*Thread Reply:* We wrapped up early today, FYI
*Thread Reply:* 😞 time management for this call has been challenging with my new role. does this time of day still make sense for everyone?
*Thread Reply:* would you prefer earlier or later hours?
*Thread Reply:* doesn't matter to me tbh, but I realize it's a late meeting in the day for y'all.
*Thread Reply:* Maybe we can move it earlier in the day for the next few weeks to make it easier for EU folks?
Hi all, here's this month's thread for the TSC meeting agenda (the meeting is next Wednesday). Please reply with any agenda items.
*Thread Reply:* A slide deck draft for the meeting: https://docs.google.com/presentation/d/1T04oYaZAhmxTzzJ7WVup118kw29IJDSs/edit?usp=sharing&ouid=116057523906319252244&rtpof=true&sd=true @Harel Shein
Hey all, I'll be facilitating the monthly TSC meeting tomorrow, would love to get agenda items for the above ^
*Thread Reply:* • ~can we merge this~
<https://github.com/OpenLineage/OpenLineage/pull/2740> ?
*Thread Reply:* Wednesday, right?
*Thread Reply:* I can talk about what's new in Airflow integration
@Paweł Leszczyński @Maciej Obuchowski - can you guys give your opinion on this one: https://github.com/OpenLineage/OpenLineage/pull/2650
(Also, how do I get access to approve the run of integration tests from Circle CI?)
*Thread Reply:* I think that you simply have to log in to CircleCI with your GH account, it's synced with OL organisation on GH.
*Thread Reply:* yeah, anyone with push access to repo should be able to approve
Do we have topics for today's committer meeting? I don't have anything now tbh
*Thread Reply:* mostly to finalize the slides for wednesday
*Thread Reply:* all the data catalogs out there just got some competition
*Thread Reply:* coincidentaly databricks is submitting a project for approval at the LFAI meeting today...
*Thread Reply:* integration with OL might become easier
*Thread Reply:* should we create first issue to accept OpenLineage when they actually open the github repo? 🙂
*Thread Reply:* it should happen today
*Thread Reply:* https://github.com/unitycatalog/unitycatalog/issues
*Thread Reply:* https://www.unitycatalog.io/
*Thread Reply:* It supports Iceberg catalog API too. Another competition with polaris catalog?
If anyone is interested in how to get OpenLineage from AWSGlue using the Spark Integration, I can share a link to my company’s beta instructions so you can see how we are handling it.
Also if anyone knows of a good place for me to figure out which spark functions are handled, that would be amazing.
*Thread Reply:* Definitely worth to take a look 🙂
> which spark functions are handled What do you mean exactly? Handling of different logical plan types is somehow spread over the code
*Thread Reply:* Ah, I assumed it was handled by function…..it’s handled by logical plan type. Customers always want to know if their jobs will be supported and if they’ll get lineage from that, and it helps to have some kind of a guideline (even if it’s “if the spark engine turns it into an X type of logical plan”).
*Thread Reply:* It also helps us figure out what we might develop for a PR - would an unsupported lineage feature be part of an existing logical plan type or a new one, etc
*Thread Reply:* > Customers always want to know if their jobs will be supported and if they’ll get lineage from that, and it helps to have some kind of a guideline (even if it’s “if the spark engine turns it into an X type of logical plan”) Sometimes the problem is it works for X version of connector (or even worse, some version of connector X with combination of Spark version Y)
It also helps us figure out what we might develop for a PR - would an unsupported lineage feature be part of an existing logical plan type or a new one, etc Yeah - Spark interfaces are a mess, and connectors do the same thing in multiple different ways. For example Hive adds their own LogicalPlan nodes, some connectors use DataSourceV1 interfaces (RelationProvider etc) and some DataSourceV2 ones, that have known LogicalPlan nodes but implement different interfaces underneath.
Overall the best solution would be implementation of the Java interfaces by the connector, which we're currently iterating on from a feedback from connector authors: https://github.com/OpenLineage/OpenLineage/pull/2675
*Thread Reply:* @Sheeri Cabral (Collibra) I'm curious - have you managed to configure OL in Glue from the job itself or just using Job Details
page in AWS console?
*Thread Reply:* I would be interested too 👀
*Thread Reply:* We have worked with customers to put it in the job itself, as per https://openlineage.io/docs/integrations/spark/configuration/usage/
*Thread Reply:* (the “directly in your application” part, and so far we’ve only tested on Python)
*Thread Reply:* Here’s an example… ```import sys from pytz import timezone from datetime import datetime from pyspark.sql import SparkSession from pyspark.sql.functions import lit, col, when
spark = SparkSession\ .builder\ .appName('openlineage')\ .config("spark.driver.extraJavaOptions","-Dlog4j.configuration=log4j.properties")\ .config("spark.executor.extraJavaOptions","-Dlog4j.configuration=log4j.properties")\ .config('spark.sql.legacy.parquet.int96RebaseModeInRead', 'CORRECTED')\ .config('spark.sql.legacy.parquet.int96RebaseModeInWrite', 'CORRECTED')\ .config('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED')\ .config('spark.sql.legacy.parquet.datetimeRebaseModeInWrite', 'CORRECTED')\ .getOrCreate()
filepath= "
filedf = spark.read.format("parquet").load(filepath)
staticvaluefieldsdf = spark.read.format("parquet").load(staticvaluefieldspath) joindf = filedf.alias("a").join(staticvaluefieldsdf.alias("b"),col("a.regulator") ==col("b.regulator"), "left")\ .withColumn("testdata",lit(10))\ .withColumn("derived_column",when(col('b.foo').isNull(),lit('No foo found')).otherwise(col('b.bar')))\
joinadf= joindf.select("a.**","testdata","derived_column")
print("first dataframe")
joinadf.write.format('parquet').mode('overwrite').save("
print("end of the job")```
*Thread Reply:* (You do need to attach the jar to the job, which is on the job details page I think?)
Delta Lake is also part of the LFAI&Data as of a few weeks ago. It will be interesting to discuss integrating the new OL interfaces for spark datasources in there. It should be either being part of the same foundation.
*Thread Reply:* @Dominik Dębowczyk is working on next version of OL interfaces - we've missed pretty huge "dependency hell" issue we have with current ones
Any topic for today's meeting?
~What do you think about making a requirement that non-committer PRs need to be approved by two committers? We're thinking of a situation when committer creates it's own PR from second account, and approves it himself.~ Let's discard this proposal.
*Thread Reply:* Sounds great, imo any changes improving security are very welcome and this one should not disrupt the development in any way, as we have committers that are active daily and will be able to accept a PR without additional delay.
*Thread Reply:* Hmmm.. I’m not sure I follow what we’re trying to protect here? If we have anyone who is a committer doing that, we will revert the PR and that committer’s privileges would be revoked immediately. As it does not follow our guidelines
*Thread Reply:* I think if we are assuming that we will detect any unwanted merged change right away, this approach you described makes sense. It may be problematic if we will overlook some changes that will lead to some backdoor (we'd have to overlook it again when releasing) or using CI for some heavy computations (this can be problematic right away) ?
*Thread Reply:* I have looked and I don't think any relevant OSS project does what I suggested, so I guess it's not a big deal.
Should we do a release soon? Would be nice to do it before CircleCI removes mac executors - to potentially give us more time before release
*Thread Reply:* > before CircleCI removes mac executors Too late (
*Thread Reply:* Not really, they just disabled it for today
*Thread Reply:* The problem is they told they will enable ARM ones in May before they remove Intel ones
*Thread Reply:* And they haven't done it yet
who's joining the committer sync today? I'm guessing we will take at most 30 minutes since Poland needs to start losing to France at 6pm ⚽
*Thread Reply:* picking up my daughter from camp but I'll try to listen in from the car. good to know I need to be rooting for France
*Thread Reply:* lol
*Thread Reply:* I think both Maciej and Jakub are OOO
*Thread Reply:* Yes, Maciej and Jakub are OOO, I'm also not joining today
*Thread Reply:* Poland did NOT lose to France today (1:1).
The greatest achievement of Polish team in Euro happened during Euro 2004. Although polish team did not qualify, few weeks before Euro we won a friendly match with Greece, which has won whole Euro 2004. Beating a champion is logically equivalent to being a champion (kind of transitive relation).
Today there was a draw with French team. So, although Polish team is already out, we're in the really good position to become a logically equivalent champion of Euro 2024. Feeling so proud.
*Thread Reply:* Probably it already happened, but can I get an invitation for the future meetings?
hey folks, was just chatting with @Ibby Khajanchi who's working on an openlineage integration at bloomberg. they're using great expectations and noticed our OL <> GX integration doesn't support the latest so he was asking about contributing!
*Thread Reply:* I'd be more than happy to give a review to that contribution 🙂
*Thread Reply:* Thanks Julian and Jakub. Looking forward to tackling this issue 😀 😀
*Thread Reply:* I left a comment about a decision to make regarding this pr's direction: https://github.com/OpenLineage/OpenLineage/pull/2134
Could someone take a look on PRs in docs repo? https://github.com/OpenLineage/docs/pulls
Hi, all, it's time to vote on this month's PR of the month! Here are the candidates: > Top 5 out of 34 PRs: > * PR #2758: Spark/transport type extraction. > * PR #2782: Spark, Flink: Fix S3 dataset names. > * PR #2756: remodeled transformation type. > * PR #2767: Spark: fallback to spark.sql.warehouse.dir as table namespace. > ** PR #2643: spark: add GCP run and job facets. Could you vote in the thread by 12pm ET on Monday? The newsletter is scheduled to go out that afternoon. Absent votes, I'll go with the top scorer, 2758. 🙂
*Thread Reply:* My vote goes for 2758, great job there.
I think #2756 and #2758 are closely related (one is spec change, the other is spark implementation), but since we have to choose one PR i think the implementation deserves it.
Any discussion topics for today?
Hi, feedback requested on some slides I'm working on for the next Airflow town hall (on 7/10). I'm planning to transition to a quick demo after the last slide. @Maciej Obuchowski
*Thread Reply:* Nice, updated version of "that" slide 🙂
*Thread Reply:* Maybe add brief mention that the work isn't done yet, and team is working on AIP-62 that would further increase amount of metadata we can extract from Airflow DAGs?
*Thread Reply:* btw, @Michael Robinson its't it today? It's today in my calendar
*Thread Reply:* Thanks! No, not today. It's one week from today
*Thread Reply:* only realized my calendar is buggy when I got into the call with another 10 confused people 🙂
*Thread Reply:* Nice slides @Michael Robinson! I commented on that slide but otherwise this looks good
Hi all, your agenda items requested for the TSC meeting next week 🙂
Hi. Can I get opinions at the bottom of this PR: https://github.com/OpenLineage/OpenLineage/pull/2134. It'll direct how I take this PR. Thank you 😄
Hi folks, is there any interest on adding Protobuf support to the Java & Python clients?
*Thread Reply:* Sure - especially if you want to contribute and help maintain it 🙂
*Thread Reply:* I'll be happy to help and review a PR if needed 🙂
Topics for tomorrow: • singular release of unshaded Spark interfaces: https://github.com/OpenLineage/OpenLineage/pull/2809
• also, agenda for this week's TSC? @Michael Robinson
*Thread Reply:* @Michael Robinson can we discuss the certification process proposal?
CC @Sheeri Cabral (Collibra)
*Thread Reply:* Absolutely!
*Thread Reply:* The doc to review: https://docs.google.com/document/d/1h_PI0HLX7ECVll068EmExZHF5xVYqVNGsNj7DCPiB5Y
Topic for today: Java 17 for Spark 4.0. More details in the docs -> https://docs.google.com/document/d/1tofVBMxDAKsbPV3Rh64SoBUGIPihgAHC1SVJj-WrR9g/edit?usp=sharing
*Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - do I understand the problem correctly: > When running tests, Gradle doesn't use the JAR, but instead uses the compiled (test) classes?
*Thread Reply:* yes, this is what I meant
*Thread Reply:* Oh. This might be a use case for the test-fixtures
plugin then
*Thread Reply:* 1. Apply the test-fixtures
plugin
src/${testFixturesSourceSetName}/java
and src/${testFixturesSourceSetName}}/resources
test
source set)*Thread Reply:* The test-fixtures
plugin creates a JAR, that can then be declared as a dependency in another gradle module:
implementation(project(":foo", configuration: "test-fixtures"))
I don't know the exact strings, but it should be something like that.
*Thread Reply:* Though, whether this solves any problems that you're currently experiencing, I'm not so sure.
*Thread Reply:* When you apply the test-fixtures
plugin, the dependency chain looks like this:
test -- depends on --> test-fixtures
test -- depends on --> main
test-fixtures -- depends on --> main
*Thread Reply:* This means the test-fixtures
source set can see the classes inside the main
source set
*Thread Reply:* Just like how the test
source set sees them now
Hi all, here's a slide deck for tomorrow's TSC meeting: https://docs.google.com/presentation/d/1lFbIFDApGzJVX6vRZSnlCUCow-ssQtpM/edit?usp=sharing&ouid=116057523906319252244&rtpof=true&sd=true
*Thread Reply:* Have opinions about what should be highlighted in 1.17 or announced about the project? Please comment in the deck! Have a last-minute idea for a discussion item, etc.? DM me!
*Thread Reply:* (I have nothing to add, I just want to say I looked at the deck and it looks great and thanks for doing this work!)