#dask is strange. Sometimes using the dask counterpart to numpy functions or arrays makes computations slower. Sometimes not. Also, lots of variability in runtime. #python #dataengineering
#dask is strange. Sometimes using the dask counterpart to numpy functions or arrays makes computations slower. Sometimes not. Also, lots of variability in runtime. #python #dataengineering
Thank you to #Kone & Mai and Tor Nessling Foundations for supporting this work. A quantitative work like this would not be possible without a robust suite of FOSS tools. My thanks to the maintainers of #QGIS, #pandas, #geopandas, #duckdb, #dask, #statsmodels, #jupyter and many more!
Working on solutions for large-scale #ScientificComputing?
#EuroSciPy2025 wants your original research on parallel and distributed computing with #Python!
Submit your breakthrough approaches to scaling scientific workloads as tutorials, talks, or posters:
I've improved my StackOverflow question and added a bounty. I'm once again asking the amazing #python , #dask , and #Django community if you could offer some of your knowledge to me and the world I suppose this might just be a #Dask question, but I am boosting it to reach out to anyone that might lend a hand
https://stackoverflow.com/q/79198230
Would any of the wonderful #python , #dask , or #Django people have a few minutes to spare helping me with a performance question? Our community is so wonderful and I'm so grateful for you all
https://stackoverflow.com/questions/79198230/django-dask-integration-how-to-do-more-with-less
I am moving all my computing libraries to #xarray, no regrets. It is a natural way to manipulate datasets of rectangular arrays, with named coordinates and dimensions: https://xarray.dev/
There are several possible backends, including #dask which allows lazy data loading.
I had the pleasure of meeting some of the devs last week, who showed me a preview of the upcoming `DataTree` structure which is going to make this library even more versatile!
Me: Groks #dask, teaches herself @matplotlib pretty fluent in @pandas_dev and #python.
Also me: signs an index incorrectly, spends 2 hours debugging a list index out of range error before spotting it
Now imagine how this #scales with tools like #Copilot and #GenerativeAI #coding tools ...
As part of my #PhD work, I recently had to perform computation on two very large files using @pandas_dev and I turned to #dask - a set of libraries on top of #pandas, aimed at scaling #python workloads from the laptop to the cluster.
Here's what I learned!
https://blog.kathyreid.id.au/2024/01/27/scaling-python-dask/
#python #geodata It's so convenient these days to have libraries like #xarray and #rioxarray that can open huge image mosaic files with 45k x 29k pixels in a virtual fashion automagically, using #dask under the hood. Just looking up a few hundred pixels using xarray logic and add a `.compute()` at the end to get the result. So cool. thanks to all those devs to make it work so nice!
Good morning folks! It's been a while since I did one of my #TwitterMigration #Introduction #ConnectionList posts where I curate interesting people for you to follow on the #Fediverse
Today, I'd like you to meet:
@LMonteroSantos Lola is a #PhD #researcher at #EUI interested in #data #regulation, digital #economy and #AntiTrust, passionate about #DataScience and #programming. New to Mastodon, please make welcome
@danlockton is a #Professor at @TUEindhoven where he works in #design, #imagination and #climate #futures. He often posts interesting things around co-design and #collaboration
@1sabelR is a #researcher @ANUResearch where she is into #SolarPunk and @scicomm She co-hosts the #SciBurst #podcast - worth a listen!
@timrichards is a #travel #writer based in #Naarm / #Melbourne in Australia, specialising in #rail
@microstevens is a #DataScience facilitator at #UWMadison and she works in #OpenScience and #genomics
@mrocklin does amazing things with #dask in #python, and I am very grateful in recent weeks for his posts and #StackOverflow responses. Thank you
@everythingopen is Australia's premier open #technology conference, covering #linux, #OpenSource, #OpenData, #OpenGov, #OpenGLAM, #OpenScience and everything else open. You should check it out!
That's all for today - don't forget to share your own lists so we can more richly connect the and curate the conversations we want to have
My #dask coding worked and I got my data! I have been trying to get this data for three weeks
Today's job is to manually validate it.
Random #research idea while babysitting a #dask process:
- I wonder if there's a way to save a bunch of terms of service documents to be able to version them, and show how they have changed over time, particularly in respect to arbitration, copyright and other #datafication processes?
Why yes I partitioned a 1Gb file into 2000 partitions with #dask, why do you ask?
I'm taking my first foray into #dask - have done the tutorial and read what I can in Stack Overflow.
It's definitely a steep learning curve, but it's been very interesting so far.
@holden's excellent book has been very useful so far, and I think the more I work with it, the more I will master the nuances - how to set up the Client scheduler with optimum workers and threads, the optimum partitions etc.
So im almost finished with my first independent implementation of a standard and I want to write up the process bc it was surprisingly challenging and I learned a lot about how to write them.
I was purposefully experimenting with different methods of translation (eg. Adapter classes vs. pure functions in a build pipeline, recursive functions vs. flattening everything) so the code isnt as sleek as it could be. I had planned on this beforehand, but two major things I learned were a) not just isolating special cases, but making specific means to organize them and make them visible, and b) isolating different layers of the standard (eg. schema language is separate from models is separate from I/O) and not backpropagating special cases between layers.
This is also my first project thats fully in the "new style" of python thats basically a typed language with validating classes, and it makes you write differently but uniformly for the better - it's almost self-testing bc if all the classes validate in an end-to-end test then you know that shit is working as intended. Forcing yourself to deal with errors immediately is the way.
Lots more 2 say but anyway we're like 2 days of work away from a fully independent translation of #NWB to #LinkML that uses @pydantic models + #Dask for arrays. Schema extensions are now no-code: just write the schema (in nwb schema lang or linkml) and poof you can use it. Hoping this makes it way easier for tools to integrate with NWB, and my next step will be to put them in a SQL database and triple store so we can yno more easily share and grab smaller pieces of them and index across lots of datasets.
Then, uh, we'll bridge our data archives + notebooks with the fedi for a new kind of scholarly communication....
Learn how to make your #Python code perform faster during our interactive Parallel #Programming #Workshop, and solve practical problems using #Dask, #Numba, and #Snakemake.
Register now!
https://www.esciencecenter.nl/event/parallel-programming-in-python-3/
super short #dask question: Do the scheduler and client need to run with the same python version (or even conda env) than the code I want to use for production? Or is it independent?