clickhouse ReplicatedMergeTree使用

发表于 2019-03-25 | 分类于 olap ， BigData ， clickhouse ，大数据

Nested数据类型使用

待写。

CSV http方式灌数命令

1	cat da.csv \| curl 'http://10.185.217.47:8123/?user=user&password=password&query=INSERT%20INTO%20table%20FORMAT%20CSV'

库、表操作

库，表的创建，删除等操作加上on cluster cluster1说明，只在一个节点上操作即可实现集群同步。

pyspark中删除dataframe中的嵌套列

发表于 2019-03-25 | 分类于 olap ， BigData ， clickhouse ，大数据

hive表中有某一列是struct类型，现在的需求是将这个struct类型中的某一子列抽取出来，并且转换成字符串类型之后，添加成与struct类型的列同一级别的列。

然后网上搜了一下答案，发现使用scala操作子列很方便，但是我们组使用语言还是python，然后搜到此方法方法：drop nested columns https://stackoverflow.com/questions/45061190/dropping-nested-column-of-dataframe-with-pyspark/48906217#48906217。我参照此方法针对我的需求做了修改。

exclude_nested_field方法中将去掉不需要的field.name，及其对应的StructType包装成的StructField，这样从schema上看就是移除了某一子列。

def exclude_nested_field(schema, unwanted_fields, parent=""):
    new_schema = []
    for field in schema:
        full_field_name = field.name
        if parent:
            full_field_name = parent + "." + full_field_name
        if full_field_name not in unwanted_fields:
            if isinstance(field.dataType, StructType):
                inner_schema = exclude_nested_field(field.dataType, unwanted_fields, full_field_name)
                new_schema.append(StructField(field.name, inner_schema))
            else:
                new_schema.append(StructField(field.name, field.dataType))
    return StructType(new_schema)

阅读全文 »

superset可视化clickhouse安装教程

发表于 2019-03-25 | 分类于 clickhouse ，可视化

官方文档安装

安装环境为centos7，python3.7，python3.7环境配置：https://segmentfault.com/a/1190000015628625#articleHeader1

教程参考：

问题：

1	Could not install packages due to an EnvironmentError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] de

使用豆瓣pip源：

1	pip install superset -i https://pypi.douban.com/simple/ --trusted-host pypi.douban.com

阅读全文 »

外置application.properties和logback.xml配置文件

发表于 2019-02-21 | 分类于 spring boot ， java

自定义打包输出

使用maven-assembly-plugin自定义打包输出，其在pom文件中的配置如下：

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-assembly-plugin</artifactId>
    <configuration>
        <finalName>4a-insight-clickhouse</finalName>
        <descriptors>
            <descriptor>src/main/assembly/package.xml</descriptor>
        </descriptors>
    </configuration>
    <executions>
        <execution>
            <!-- 绑定到package生命周期阶段上 -->
            <phase>package</phase>
            <goals>
                <!-- 只运行一次 -->
                <goal>single</goal>
            </goals>
        </execution>
    </executions>
    <resources>
    <resource>
        <directory>src/main/resources</directory>
        <excludes>
            <exclude>logback-spring.xml</exclude>
        </excludes>
    </resource>
</resources>
</plugin>

<descriptors>描述自定义文件的位置，<phase>maven打包的哪个生命周期生效。由于有了外置的logback-spring.xml文件，所以要排除掉jar包内的logback-spring.xml文件，要不然会出现错误。

阅读全文 »

解决pyspark部署模式由client切换成cluster报错的问题

发表于 2018-12-27 | 分类于大数据， BigData ， Spark

问题

写了一个pyspark的代码，自定义了一些py文件import进来使用，并且通过shell脚本传8个参数，如下：

#!/usr/bin/env bash
spark-submit \
     --master yarn \
     --deploy-mode cluster \
     --conf spark.shuffle.service.enabled=true \
     --queue xxx \
     --conf spark.dynamicAllocation.enabled=true \
     --conf spark.default.parallelism=1000 \
     --conf spark.sql.shuffle.partitions=1000 \
     --conf spark.sql.broadcastTimeout=7200 \
     --executor-memory 18g \
     --executor-cores 3 \
     --conf spark.blacklist.enabled=true dependencies/test.py $1 $2 $3 $4 $5 $6 $7 $8

但是由--deploy-mode client切换成--deploy-mode cluster之后console上却报如下错误：

阅读全文 »

将数据通过spark从hive导入到Clickhouse

发表于 2018-12-20 | 分类于 olap ， BigData ， clickhouse ，大数据， spark

本文介绍如何通过spark使用JDBC的方式将数据从hive导入到clickhouse中，参考地址为：https://github.com/yandex/clickhouse-jdbc/issues/138

spark代码hive2mysql_profile.py为：

# -*- coding: utf-8 -*-
import datetime
from pyspark.sql import SparkSession
import sys


def sync_profiles(spark, url, driver, yesterday):
    userprofile_b_sql = '''select *  from app.table_test where dt = \'{date}\'  '''.format(
        date=yesterday)
    result = spark.sql(userprofile_b_sql)
    properties = {'driver': driver,
                  "socket_timeout": "300000",
                  "rewriteBatchedStatements": "true",
                  "batchsize": "1000000",
                  "numPartitions": "1",
                  'user': 'root',
                  'password': '123456'}

    result.write.jdbc(url=url, table='dmp9n_user_profile_data_bc', mode='append', properties=properties)


if __name__ == '__main__':
    yesterday = (datetime.datetime.now() + datetime.timedelta(days=-1)).strftime("%Y-%m-%d")
    if len(sys.argv) == 2:
        yesterday = sys.argv[1]

    spark = SparkSession.builder \
        .appName("hive2clickhouse") \
        .enableHiveSupport() \
        .getOrCreate()

    url = "jdbc:clickhouse://11.40.243.166:8123/insight"
    driver = 'ru.yandex.clickhouse.ClickHouseDriver'
    sync_profiles(spark, url, driver, yesterday)

阅读全文 »

Clickhouse 使用总结

发表于 2018-12-20 | 分类于 olap ， BigData ， clickhouse ，大数据

Clickhouse 使用总结

以下是本人短时间内的clickhouse调研使用总结，很多坑还没踩过，所以只是一些浅显的介绍。

配置文件

clickhouse的配置文件包括/etc/clickhouse-server/下的config.xml，users.xml以及/etc/下的集群配置文件metrika.xml。通过metrika.xml可以看到节点登陆的明文用户名，密码。关于配置文件的说明不再赘述。

客户端

clickhouse提供连接的客户端有好几种，有命令行客户端，JDBC驱动的客户端和HTTP客户端等，这里主要将如何在spring boot中封装HTTP客户端和基于JDBC驱动集成Mybatis。

JDBC

JDBC驱动使用官方提供的，走8123端口，使用http协议。官方驱动连接地址为：https://github.com/yandex/clickhouse-jdbc

阅读全文 »

python画图工具matplotlib使用

发表于 2018-12-06 | 分类于 python ，工具，画图

前言

对于数据画图本人毕业之前还是比较喜欢用matlab，但现在工作后本地没有安装matlab软件，所以只能打算用python来画图。上网搜了一下，发现python中使用matplotlib来画图很受欢迎，加上刚好有画图的需要，所以打算试一下。

需求：现有2017一年的数据，时间间隔为1小时，所以数据量大概为 24 * 365。由于图片尺寸不能过大，横坐标显示所有值的话会密密麻麻重重叠叠，所以横坐标只能按月份显示，但纵坐标的数据必须都得显示。

阅读全文 »

spark sql 调优

发表于 2018-10-12 | 分类于大数据， BigData ， Spark ， Spark-Sql，调优

因为在电商工作所有有机会接触到上百甚至上千亿级的数据，所以在实际工作当中难免会遇到资源配置调优和数据倾斜问题，通过组内同事以及网上各种教程的帮助，终于解决了一系列问题，达到了了上线标准。感谢组内老大给我这么多的时间让我去学习，去研究，同时也希望将这个过程记录下来作为以后避免以后遇到类似问题时做重复的工作。

阅读全文 »

解决hive表小文件过多问题

发表于 2018-09-06 | 分类于 hive ，大数据， BigData ， Spark

背景

前些时间，运维的同事反应小文件过多问题，需要我们去处理，所以想到是以何种手段去合并现有的小文件。我们知道Hadoop需要在namenode维护文件索引相关的metadata，所以小文件过多意味着消耗更大的内存空间。

过程

经过网上的调研发现通过hive表使用orc格式进行存储能够通过concatenate命令对分区进行小文件合并，并且能够节省80%以上的存储空间，真是喜闻乐见！

阅读全文 »

Alex Wong

不管年龄大小，每个人都是我的老师

Nested数据类型使用

CSV http方式灌数命令

库、表操作

相关问题

官方文档安装

自定义打包输出

问题

Clickhouse 使用总结

配置文件

客户端

JDBC

前言

背景

过程