Airflow Azkaban Conductor Oozie Step Functions
Owner Apache
(previously Airbnb)
LinkedIn Netflix Apache Amazon
Community Very Active Somewhat active Active Active N/A
History 4 years 7 years 1.5 years 8 years 1.5 years
Main Purpose General Purpose Batch Processing Hadoop Job Scheduling Microservice orchestration Hadoop Job Scheduling General Purpose Workflow Processing
Flow Definition Python Custom DSL JSON XML JSON
Support for single node Yes Yes Yes Yes N/A
Quick demo setup Yes Yes Yes No N/A
Support for HA Yes Yes Yes Yes Yes
Single Point of Failure Yes
(Single scheduler)
Yes
(Single web and scheduler combined node)
No No No
HA Extra Requirement Celery/Dask/Mesos + Load Balancer + DB DB Load Balancer (web nodes) + DB Load Balancer (web nodes) + DB + Zookeeper Native
Cron Job Yes Yes No Yes Yes
Execution Model Push Push Poll Poll Unknown
Rest API Trigger Yes Yes Yes Yes Yes
Parameterized Execution Yes Yes Yes Yes Yes
Trigger by External Event Yes No No Yes Yes
Native Waiting Task Support Yes No Yes (external signal required) No Yes
Backfilling support Yes No No Yes No
Native Web Authentication LDAP/Password XML Password No No No
Monitoring Yes Limited Limited Yes Limited
Scalability Depending on executor setup Good Very Good Very Good Very Good

Disclaimer

I’m not an expert in any of those engines. I’ve used some of those (Airflow & Azkaban) and checked the code. For some others I either only read the code (Conductor) or the docs (Oozie/AWS Step Functions). As most of them are OSS projects, it’s certainly possible that I might have missed certain undocumented features, or community-contributed plugins. I’m happy to update this if you see anything wrong.

Bottom line: Use your own judgement when reading this post.

Airflow

The Good

Airflow is a super feature rich engine compared to all other solutions. Not only you can use plugins to support all kinds of jobs, ranging from data processing jobs: Hive, Pig (though you can also submit them via shell command), to general flow management like triggering by existence of file/db entry/s3 content, or waiting for expected output from a web endpoint, but also it provides a nice UI that allows you to check your DAGs (workflow dependencies) through code/graph, and monitors the real time execution of jobs.

Airflow is also highly customizable with a currently vigorous community. You can run all your jobs through a single node using local executor, or distribute them onto a group of worker nodes through Celery/Dask/Mesos orchestration.

The Bad

Airflow by itself is still not very mature (in fact maybe Oozie is the only “mature” engine here). The scheduler would need to periodically poll the scheduling plan and send jobs to executors. This means it along would continuously dump enormous amount of logs out of the box. As it works by “ticking”, your jobs are not guaranteed to get scheduled in “real-time” if that makes sense and this would get worse as the number of concurrent jobs increases. Meanwhile as you have one centralized scheduler, if it goes down or gets stuck, your running jobs won’t be affected as that the job of executors, but no new jobs will get scheduled. This is especially confusing when you run this with a HA setup where you have multiple web nodes, a scheduler, a broker (typically a message queue in Celery case), multiple executors. When scheduler is stuck for whatever reason, all you see in web UI is all tasks are running, but in fact they are not actually moving forward while executors are happily reporting they are fine. In other words, the default monitoring is still far from bullet proof.

The web UI is very nice from the first look. However it sometimes is confusing to new users. What does it mean my DAG runs are “running” but my tasks have no state? The charts are not search friendly either, let alone some of the features are still far from well documented (though the document does look nice, I mean, compared to Oozie, which does seem out-dated).

The backfilling design is good in certain cases but very error prone in others. If you have a flow with cron schedules disabled and re-enabled later, it would try to play catch up, and if your jobs is not designed to be idempotent, shit would happen for real.

Azkaban

The Good

Of all the engines, Azkaban is probably the easiest to get going out of the box. UI is very intuitive and easy to use. Scheduling and REST APIs works just fine.

Limited HA setup works out of the box. There’s no need for load balancer because you can only have one web node. You can configure how it selects executor nodes to push jobs to and it generally seems to scale pretty nicely. You can easily run tens of thousands of jobs as long as you have enough capacity for the executor nodes.

The Bad

It is not very feature rich out of the box as a general purpose orchestration engine, but likely that’s not what’s originally designed for. It’s strength lies in native support for Hadoop/Pig/Hive, though you can also achieve those using command line. But itself cannot trigger jobs through external resources like Airflow, nor does it support job waiting pattern. Although you can do busy waiting through java code/scripts, that leads to bad resource utilization.

The documentation and configuration are generally a bit confusing compared to others. It’s likely that it wasn’t supposed to be OSed at the beginning. The design is okish but you better have a big data center to run the executors as scheduling would get stalled when executors run out of resources without extra monitoring stuff. The code quality overall is a bit towards the lower end compared to others so it generally only scales well when resource is not a problem.

The setup/design is not cloud friendly. You are pretty much supposed to have stable bare metal rather than dynamically allocated virtual instances with dynamic IPs. Scheduling would go south if machines vanish.

The monitoring part is sort of acceptable through JMX (does not seem documented). But it generally doesn’t work well if your machines are heavily loaded, unfortunately, as the endpoints may get stuck.

Conductor

The Good

It’s a bit unfair to put Conductor into this competition as it’s real purpose is for microservice orchestration, whatever that means. It’s HA model involves a quorum of servers sitting behind load balancer putting tasks onto a message queue which the worker nodes would poll from, which means it’s less likely you’ll run into stalled scheduling. With the help of parameterized execution through API, it’s actually quite good at scheduling and scaling provided that you set up your load balancer/service discovery layer properly.

The Bad

The UI needs a bit more love. There’s currently very limited monitoring there. Although for general purpose scheduling that’s probably good enough.

It’s pretty bare-bone out of the box. There’s not even native support for running shell scripts, though it’s pretty easy to implement a task worker through python to do the job with the examples provided.

Oozie

The Good

Oozie provides a seemingly reliable HA model through the db setup (seemingly b/c I’ve not dug into it). It provides native support for Hadoop related jobs as it was sort of built for that eco system.

The Bad

Not a very good candidate for general purpose flow scheduling as the XML definition is quite verbose and cumbersome for defining light weight jobs.

It also requires quite a bit of peripheral setup. You need a zookeeper cluster, a db, a load balancer and each node needs to run a web app container like Tomcat. The initial setup also takes some time which is not friendly to first time users to pilot stuff.

Step Functions

The Good

Step Functions is fairly new (launch in Dec 2016). However the future seems promising. With the HA nature of cloud platform and lambda functions, it almost feels like it can easily scale infinitely (compared to others).

It also offers some useful features for general purpose workflow handling like waiting support and dynamic branching based on output.

It’s also fairly cheap:

  • 4,000 state transitions are free each month
  • $0.025 per 1,000 state transitions thereafter ($0.000025 per state transition)

If you don’t run tens of thousands of jobs, this might be even better than running your own cluster of things.

The Bad

Can only be used by AWS users. Deal breaker if you are not one of them yet.

Lambda requires extra work for production level iteration/deployment.

There’s no UI (well there is but it’s really just a console). So if you need any level of monitoring beyond that you need to build it using cloudwatch by yourself.

Comment and share

Table of Contents

  1. == and ===
  2. Dig deeper
    1. What about arrays?
    2. What about objects
  3. Implicit conversions
  4. Conclusion

== and ===

Likely you know the difference between == and ===: basically, === means strict equality where no implicit conversion is allowed whereas == is loose equality.

1
2
3
4

'a' === 'a' // true
0 == false // true

Dig deeper

OK but this is too boring since we all know that.

How about this:

1
2
3
4

String('a') === 'a'
new String('a') === 'a'

Well the answers are true and false because String() returns a primitive string while new String() returns a string object. Surely new String('a') == 'a' yields true. No surprise.

What about arrays?

[] === []

Well this returns false because for non-primitive objects, they are compared by reference. This always returns false because they are different in terms of memory location.

However surprisingly you can compare arrays like this:

[1, 2, 3] < [2, 3]      // true
[2, 1, 3] > [1, 2, 3]   // true
Blonde hmmm

(Wait a sec. I think I have an idea.)

How about this:

function arrEquals(arr1, arr2) {
    return !(arr1 < arr2) && !(arr2 < arr1);
}
Fuck yeah smile

Well this is wrong because arrays will be flattened when compared, like this

[[1, 2], 1] < [1, 2, 3]     // true

What about objects

What’s the result of this expression?

{} === {}

Well it’s neither true nor false but you get SyntaxError because in this case {} is not an object literal but a code block and thus it cannot be followed with =. Anyway we are drifting away from the original topic…

Implicit conversions

Well that’s just warm-up. Let’s see something serious.

If you read something about “best practices”, you would probably be told not to use == because of the evil conversion. However chances are you’ve used it here and there and most likely that’s also part of the “best practices”.

For example:

var foo = bar();
if (foo) {
    doSomething();
}

This works because in JavaScript, only 6 object/literals are evaluated to false. They are 0, '', NaN, undefined, null and of course false. Rest of the world evaluates to true, including {} and [].

Hmm here’s something wacky:

1
2
3
4
5
6
7
8
9
10
11

var a = {
valueOf: function () {
return -1;
}
};

if (!(1 + a)) {
alert('boom');
}

Your code does go boom because 1 + a gets implicitly converted to 1 + a.valueOf() and hence yields 0.

The actual behavior is documented in ECMA standard - http://www.ecma-international.org/ecma-262/6.0/#sec-abstract-equality-comparison

In most cases, implicit conversion would cause valueOf() to be called or falls back to toString() if not defined.

For example:

1
2
3
4
5
6
7
8
9
10
11
12

var foo = {
valueOf: function () {
return 'value';
},
toString: function () {
return 'toString';
}
};

'foo' + foo // foovalue

This is because according to standard, when toPrimitive is invoked for implicit conversion with no hint provided (e.g. in the case of concatenation, or when == is used between different types), it by default prefers valueOf. There are a few exceptions though, including but not limited to Array.prototype.join and alert. They would call toPrimitive with string as the hint so toString() will be favored.

Conclusion

In general, you probably want to avoid using == and use === most of the time if not always to avoid worrying about wonky implicit conversion magic.

However, you can’t be wary enough. For example:

isNaN('1') === true

You might think that '1' is a string and hence this should be false but unfortunately isNaN always calls toNumber internally (spec) and hence this is true.

Computer stare

Comment and share

Table of Contents

  1. Have you seen eval() written like this?
  2. Regular eval
  3. Global eval
  4. Back to the original topic

Recently I’ve been writing quite a bit of front-end stuff and seen quite a few tricks from other people’s libraries. It turns out JavaScript is a pretty wonky and fked up interesting language, which tempts me to write a series about it and this is the first one. This is by no means supposed to show how to write JS but just to show some “wacky” stuff.

Have you seen eval() written like this?

(0, eval)('something');
Are you fucking kidding me

Regular eval

Eval basically allows you to execute any script within the given context.

For example:

1
2
3
4
5
6
7
8
eval('console.log("123");');            // prints out 123

(function A() {
this.a = 1;

eval('console.log(this.a);'); // 1
})();

So far everything is normal: eval runs inside the current scope. this is pointed to the instance of A.

Global eval

Things get interesting when you do this:

1
2
3
4
5
6
7
8
var someVar = 'outer';

(function A() {
this.someVar = 'inner';

eval('console.log(someVar);'); // you may want 'outer' but this says 'inner'
})();

Well in this scenario eval cannot get the value of someVar in the global scope.

However ECMA5 says, if you change eval() call to indirect, in other words, if you use it as a value rather than a function reference, then it will evaluate the input in the global scope.

So this would work:

1
2
3
4
5
6
7
8
9
var someVar = 'outer';

(function A() {
var geval = eval;
this.someVar = 'inner';

geval('console.log(someVar);'); // 'outer'
})();

Although geval and eval call the exact same function, geval is a value and thus it becomes an indirect call according to ECMA5.

Back to the original topic

So what the hell is (0, eval) then? Well a comma separated expression list evaluates to the last value, so it essentially is a shortcut to

var geval = eval;
geval(...);

0 is only a puppet here. It could be any value.

So much win

Comment and share

Stop bundling in the http/2 world since it does it for you.

Modularization is a great idea

Back in the old days where there were no concept regarding frontend package management, we would lay out all the scripts in order in the html file, and hope for the best that they would somehow work together if order were right. This surely doesn’t work well with huge projects, but luckily back then JavaScripts weren’t so shiny anyways - UIs weren’t so cool and logic was much simpler. However, things do evolve. People soon noticed that this approach wouldn’t scale - cooperation across multiple teams becomes super tricky, if not impossible, and it doesn’t play well with DRY either.

Then people came up with a great idea of modularizing JS code (probably back in 2003?) the same way you would do for your beloved Java/C++ code libraries. And then there came the CommonJS definition concept by Kevin Dangoor back in 2009. Many people got to know this idea thanks to Node.js, and it works quite well, especially for server side code. Now you can easily use npm and build both the frontend and backend using the same tool very quickly, thanks to the JS community. Since people have the same interface for code modularization, team cooperation becomes much easier and projects gain benefit from much better encapsulation.

Continue reading

There are use cases where data need to be read from source to a sink without modification. In code this might look quite simple: for example in Java, you may read data from one InputStream chunk by chunk into a small buffer (typically 8KB), and feed them into the OutputStream, or even better, you could create a PipedInputStream, which is basically just a util that maintains that buffer for you. However, if low latency is crucial to your software, this might be quite expensive from the OS perspective and I shall explain.

What happens under the hood

Well, here’s what happens when the above code is used:

  1. JVM sends read() syscall.
  2. OS context switches to kernel mode and reads data into the input socket buffer.
  3. OS kernel then copies data into user buffer, and context switches back to user mode. read() returns.
  4. JVM processes code logic and sends write() syscall.
  5. OS context switches to kernel mode and copies data from user buffer to output socket buffer.
  6. OS returns to user mode and logic in JVM continues.
Continue reading

You should read this if

  • You want to set up a personal blog
  • You know what Markdown is
  • You don’t want to set up a heavy Wordpress environment
  • You don’t want to set up any database just for the blog
  • You either don’t have a VPS or want to host blog content in some easy-to-reach place.
  • You still want a template/theme system.

Solution

Github Pages + Hexo (what this site uses)

Continue reading
  • page 1 of 1
Author's picture

Shawn Xu

Full-stack Software Engineer in Bay Area