Apache Markdown

Apache Markdown Download
Apache Superset Markdown
Apache Markdown Tutorial
Apache Mod_ Markdown
Apache 2.0 Markdown

This page tracks external software projects that supplement Apache Spark and add to its ecosystem.

To add a project, open a pull request against the spark-website repository. Add an entry to this markdown file, then run jekyll build to generate the HTML too. Includeboth in your pull request. See the README in this repo for more information.

Note that all project and product names should follow trademark guidelines.

spark-packages.org

spark-packages.org is an external, community-managed list of third-party libraries, add-ons, and applications that work with Apache Spark. You can add a package as long as you have a GitHub repository.

Infrastructure Projects

A full featured Markdown editor with live preview and syntax highlighting. Supports GitHub flavored Markdown. See the change log for changes and road map. Powered by Markdig - the best markdown parser. A GitHub Action to generate Markdown-formatted release notes automatically for a project. Python Apache-2.0 0 0 0 1 Updated Mar 20, 2021. Click to get the latest Where Are They Now? We would like to show you a description here but the site won’t allow us.

REST Job Server for Apache Spark - REST interface for managing and submitting Spark jobs on the same cluster.
MLbase - Machine Learning research project on top of Spark
Apache Mesos - Cluster management system that supports running Spark
Alluxio (née Tachyon) - Memory speed virtual distributed storage system that supports running Spark
FiloDB - a Spark integrated analytical/columnar database, with in-memory option capable of sub-second concurrent queries
Zeppelin - Multi-purpose notebook which supports 20+ language backends,including Apache Spark
EclairJS - enables Node.js developers to codeagainst Spark, and data scientists to use Javascript in Jupyter notebooks.
Mist - Serverless proxy for Spark cluster (spark middleware)
K8S Operator for Apache Spark - Kubernetes operator for specifying and managing the lifecycle of Apache Spark applications on Kubernetes.
IBM Spectrum Conductor - Cluster management software that integrates with Spark and modern computing frameworks.
Delta Lake - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads.
MLflow - Open source platform to manage the machine learning lifecycle, including deploying models from diverse machine learning libraries on Apache Spark.
Koalas - Data frame API on Apache Spark that more closely follows Python’s pandas.
Apache DataFu - A collection of utils and user-defined-functions for working with large scale data in Apache Spark, as well as making Scala-Python interoperability easier.

Applications Using Spark

Apache Mahout - Previously on Hadoop MapReduce, Mahout has switched to using Spark as the backend
Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark
BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark
Spindle - Spark/Parquet-based web analytics query engine
Thunderain - a framework for combining stream processing with historical data, think Lambda architecture
DF from Ayasdi - a Pandas-like data frame implementation for Spark
Oryx - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
ADAM - A framework and CLI for loading, transforming, and analyzing genomic data using Apache Spark
TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning
Natural Language Processing for Apache Spark - A library to provide simple, performant, and accurate NLP annotations for machine learning pipelines
Rumble for Apache Spark - A JSONiq engine to query, with a functional language, large, nested, and heterogeneous JSON datasets that do not fit in dataframes.

Performance, Monitoring, and Debugging Tools for Spark

Performance and debugging library - A library to analyze Spark and PySpark applications for improving performance and finding the cause of failures
Data Mechanics Delight - Delight is a free, hosted, cross-platform Spark UI alternative backed by an open-source Spark agent. It features new metrics and visualizations to simplify Spark monitoring and performance tuning.

Additional Language Bindings

C# / .NET

Mobius: C# and F# language binding and extensions to Apache Spark

Clojure

Geni - A Clojure dataframe library that runs on Apache Spark with a focus on optimizing the REPL experience.

Groovy

Julia

Kotlin

-->

In this tutorial, you learn how to create a dataframe from a csv file, and how to run interactive Spark SQL queries against an Apache Spark cluster in Azure HDInsight. In Spark, a dataframe is a distributed collection of data organized into named columns. Dataframe is conceptually equivalent to a table in a relational database or a data frame in R/Python.

In this tutorial, you learn how to:

Create a dataframe from a csv file
Run queries on the dataframe

Prerequisites

An Apache Spark cluster on HDInsight. See Create an Apache Spark cluster.

Create a Jupyter Notebook

Jupyter Notebook is an interactive notebook environment that supports various programming languages. The notebook allows you to interact with your data, combine code with markdown text and perform simple visualizations.

Apache Markdown Download

Edit the URL https://SPARKCLUSTER.azurehdinsight.net/jupyter by replacing SPARKCLUSTER with the name of your Spark cluster. Then enter the edited URL in a web browser. If prompted, enter the cluster login credentials for the cluster.
From the Jupyter web page, Select New > PySpark to create a notebook.
A new notebook is created and opened with the name Untitled(Untitled.ipynb).
Note
By using the PySpark kernel to create a notebook, the spark session is automatically created for you when you run the first code cell. You do not need to explicitly create the session.

Create a dataframe from a csv file

Apache Superset Markdown

Applications can create dataframes directly from files or folders on the remote storage such as Azure Storage or Azure Data Lake Storage; from a Hive table; or from other data sources supported by Spark, such as Cosmos DB, Azure SQL DB, DW, and so on. The following screenshot shows a snapshot of the HVAC.csv file used in this tutorial. The csv file comes with all HDInsight Spark clusters. The data captures the temperature variations of some buildings.

Paste the following code in an empty cell of the Jupyter Notebook, and then press SHIFT + ENTER to run the code. The code imports the types required for this scenario:
When running an interactive query in Jupyter, the web browser window or tab caption shows a (Busy) status along with the notebook title. You also see a solid circle next to the PySpark text in the top-right corner. After the job is completed, it changes to a hollow circle.
Note the session id returned. From the picture above, the session id is 0. If desired, you can retrieve the session details by navigating to https://CLUSTERNAME.azurehdinsight.net/livy/sessions/ID/statements where CLUSTERNAME is the name of your Spark cluster and ID is your session id number.
Run the following code to create a dataframe and a temporary table (hvac) by running the following code.

Run queries on the dataframe

Once the table is created, you can run an interactive query on the data.

Run the following code in an empty cell of the notebook:
The following tabular output is displayed.
You can also see the results in other visualizations as well. To see an area graph for the same output, select Area then set other values as shown.
From the notebook menu bar, navigate to File > Save and Checkpoint.
If you're starting the next tutorial now, leave the notebook open. If not, shut down the notebook to release the cluster resources: from the notebook menu bar, navigate to File > Close and Halt.

Clean up resources

With HDInsight, your data and Jupyter Notebooks are stored in Azure Storage or Azure Data Lake Storage, so you can safely delete a cluster when it isn't in use. You're also charged for an HDInsight cluster, even when it's not in use. Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they aren't in use. If you plan to work on the next tutorial immediately, you might want to keep the cluster.

Apache Markdown Tutorial

Open the cluster in the Azure portal, and select Delete.

Apache Mod_ Markdown

You can also select the resource group name to open the resource group page, and then select Delete resource group. By deleting the resource group, you delete both the HDInsight Spark cluster, and the default storage account.

Next steps

Apache 2.0 Markdown

In this tutorial, you learned how to create a dataframe from a csv file, and how to run interactive Spark SQL queries against an Apache Spark cluster in Azure HDInsight. Advance to the next article to see how the data you registered in Apache Spark can be pulled into a BI analytics tool such as Power BI.